Converts the dosage-encoded haplotype strings produced by
extract_haplotypes (e.g. "02110") into
nucleotide sequences (e.g. "AGTT?") using the REF and ALT
alleles of each SNP in the block.
Usage
decode_haplotype_strings(
haplotypes,
snp_info,
min_freq = 0.01,
top_n = NULL,
missing_string = "."
)Arguments
- haplotypes
List from
extract_haplotypes.- snp_info
Data frame with columns
SNP,CHR,POS,REF,ALT. Must contain all SNPs in the blocks.- min_freq
Minimum haplotype frequency to include. Default
0.01.- top_n
Integer or
NULL. Maximum alleles per block.NULL(default) retains all abovemin_freq.- missing_string
Missing genotype marker. Default
".".
Value
A data frame with columns:
- block_id
Block identifier.
- CHR
Chromosome.
- start_bp, end_bp
Block boundaries.
- hap_rank
Rank by frequency (1 = most common).
- hap_id
Column name as it appears in the feature matrix.
- dosage_string
Raw dosage string e.g.
"02110".- nucleotide_sequence
Decoded nucleotide string e.g.
"AGTT?".- frequency
Observed frequency across non-missing individuals.
- n_carriers
Number of individuals carrying this haplotype (dosage > 0).
- snp_positions
Semicolon-separated CHR:POS of each SNP in the block.
- snp_alleles
Semicolon-separated REF/ALT for each SNP.
Details
Each character in a haplotype string is the dosage at one SNP in the block:
"0"= homozygous REF -> REF nucleotide (e.g.A)"1"= heterozygous -> IUPAC ambiguity code (e.g.Rfor A/G)"2"= homozygous ALT -> ALT nucleotide (e.g.G)"."= missing ->N
The result is a data frame with one row per unique haplotype allele per block, showing its nucleotide sequence, frequency, and the REF/ALT at each SNP position. This is the most interpretable representation of what each haplotype allele actually encodes biologically.
Examples
data(ldx_geno, ldx_snp_info, ldx_blocks, package = "LDxBlocks")
haps <- extract_haplotypes(ldx_geno, ldx_snp_info, ldx_blocks, min_snps = 3)
decoded <- decode_haplotype_strings(haps, ldx_snp_info)
head(decoded[, c("block_id","hap_rank","dosage_string",
"nucleotide_sequence","frequency")])
#> block_id hap_rank dosage_string
#> 1 block_1_1000_25027 1 1011121012111011210020110
#> 2 block_1_1000_25027 2 0111021121110121201121101
#> 3 block_1_1000_25027 3 0010010011000010101111000
#> 4 block_1_1000_25027 4 2022222002222002220020220
#> 5 block_1_1000_25027 5 1122122111221112211121211
#> 6 block_1_1000_25027 6 0000020022000020200020000
#> nucleotide_sequence frequency
#> 1 SCWRRTYAKTSSWARYCMGACCWRG 0.22
#> 2 GSWRGTYMTWSSTRAYCAKWCSWAR 0.14
#> 3 GCWAGYTAKWGCTARCYAKWSSTAG 0.12
#> 4 CCAGATCAGTCGAAGTCCGACCAGG 0.12
#> 5 SSAGRTCMKWCGWRRTCMKWCSARR 0.11
#> 6 GCTAGTTATTGCTAACCAGACCTAG 0.10