Skip to contents

Builds per-block haplotype dosage strings for all individuals across the LD blocks in blocks. Each block is processed by the C++ engine extract_chr_haplotypes_cpp() (unphased) or extract_chr_haplotypes_phased_cpp() (phased VCF input), which assigns each individual a dosage string of 0/1/2 characters (one per SNP in the block) and identifies the top haplotype alleles by frequency.

Usage

extract_haplotypes(
  geno,
  snp_info,
  blocks,
  chr = NULL,
  min_snps = 3L,
  na_char = "."
)

Arguments

geno

One of:

  • An LDxBlocks_backend from read_geno or read_geno_bigmemory (streaming, one chromosome at a time).

  • A named list with elements hap1 and hap2 (phased SNPs x individuals matrices from read_phased_vcf).

  • A numeric matrix (individuals x SNPs, values 0/1/2/NA).

snp_info

Data frame with columns SNP, CHR, POS.

blocks

Data frame of LD blocks from run_Big_LD_all_chr, with columns CHR, start.bp, end.bp, n_snps.

chr

Character vector of chromosomes to process. NULL (default) processes all chromosomes present in blocks.

min_snps

Integer. Minimum number of SNPs a block must contain to be included. Default 3L.

na_char

Character. Symbol used to denote missing genotype in the dosage string. Default ".".

Value

A named list of per-block haplotype dosage matrices (individuals x haplotype alleles, values 0/1/2 for phased data or 0/1 for unphased). The list carries a block_info attribute (data frame with one row per block: block_id, CHR, start_bp, end_bp, n_snps, n_haplotypes, phased).

Examples

data(ldx_geno, ldx_snp_info, ldx_blocks)
haps <- extract_haplotypes(ldx_geno, ldx_snp_info, ldx_blocks, min_snps = 3L)
length(haps)                     # one element per block
#> [1] 9
names(haps)[1]                   # e.g. "block_1_1000_25000"
#> [1] "block_1_1000_25027"
dim(haps[[1]])                   # individuals x haplotype alleles
#> NULL