Harmonize Haplotype Allele Labels Across Panels or Analysis Runs
Source:R/haplotype_inference.R
harmonize_haplotypes.RdEnsures that haplotype allele labels are biologically comparable across
different datasets, analysis runs, or training/validation splits. Without
harmonization, the allele string "010110" in one panel is not
guaranteed to correspond to the same biological haplotype in another panel
if block boundaries, SNP ordering, or allele encoding differ between runs.
The function anchors allele identity to a reference dictionary built from a training/reference panel. New (target) haplotypes are then matched against this dictionary:
Exact match: the allele string exists verbatim in the reference dictionary -> labelled with the reference allele label.
Nearest-Hamming match: no exact match -> labelled with the most similar reference allele (minimum Hamming distance). If the minimum Hamming distance exceeds
max_hamming, the allele is labelled"<novel>".Novel: distance >
max_hamming->"<novel>".
Usage
harmonize_haplotypes(
haplotypes_target,
haplotypes_ref,
min_freq_ref = 0.02,
max_hamming = NULL,
missing_string = "."
)Arguments
- haplotypes_target
Named list from
extract_haplotypes(the panel to harmonize - validation set, new environment, etc.).- haplotypes_ref
Named list from
extract_haplotypes(the reference panel - training set, base population, etc.). Must cover the same blocks ashaplotypes_target(extra blocks in either panel are silently skipped).- min_freq_ref
Numeric. Only alleles above this frequency in the reference panel form the dictionary. Default
0.02.- max_hamming
Integer. Maximum Hamming distance for a nearest-neighbour match; alleles beyond this distance are labelled
"<novel>". DefaultNULL(no limit - always assigns to nearest reference allele).- missing_string
Character. Missing data placeholder. Default
".".
Value
Named list of the same structure as haplotypes_target, with
allele strings replaced by their reference-anchored equivalents. The
block_info attribute from haplotypes_target is preserved.
A harmonization_report attribute is attached: a data frame with
one row per block reporting n_exact, n_nearest,
n_novel, and mean_hamming_dist for matched alleles.
Examples
# \donttest{
data(ldx_geno, ldx_snp_info, ldx_blocks, package = "LDxBlocks")
# Split into training (70 pct) and validation (30 pct)
n <- nrow(ldx_geno)
idx <- sample(n)
ref_geno <- ldx_geno[idx[1:round(n*0.7)], ]
tgt_geno <- ldx_geno[idx[(round(n*0.7)+1):n], ]
haps_ref <- extract_haplotypes(ref_geno, ldx_snp_info, ldx_blocks)
haps_tgt <- extract_haplotypes(tgt_geno, ldx_snp_info, ldx_blocks)
haps_harm <- harmonize_haplotypes(haps_tgt, haps_ref)
attr(haps_harm, "harmonization_report")
#> block_id n_exact n_nearest n_novel
#> block_1_1000_25027 block_1_1000_25027 24 12 0
#> block_1_81064_99022 block_1_81064_99022 32 4 0
#> block_1_155368_179371 block_1_155368_179371 31 5 0
#> block_2_1000_30023 block_2_1000_30023 28 8 0
#> block_2_86236_105290 block_2_86236_105290 31 5 0
#> block_2_161515_180473 block_2_161515_180473 33 3 0
#> block_3_1000_19068 block_3_1000_19068 30 6 0
#> block_3_74532_93854 block_3_74532_93854 33 3 0
#> block_3_149647_168376 block_3_149647_168376 31 5 0
#> mean_hamming_dist
#> block_1_1000_25027 4.000000
#> block_1_81064_99022 1.000000
#> block_1_155368_179371 1.400000
#> block_2_1000_30023 4.250000
#> block_2_86236_105290 1.000000
#> block_2_161515_180473 1.000000
#> block_3_1000_19068 1.166667
#> block_3_74532_93854 1.000000
#> block_3_149647_168376 1.000000
# }