Haplotype Prediction and Block Importance from Pre-Adjusted Phenotypes
Source:R/haplotypes.R
run_haplotype_prediction.RdRuns the complete Tong et al. (2025) haplotype stacking pipeline using pre-adjusted phenotype values (BLUEs, BLUPs, or adjusted entry means). Accepts either a single trait or multiple traits simultaneously.
When a single trait is supplied, GBLUP is fitted via
rrBLUP::kin.blup() and block importance is ranked by
Var(local GEBV) for that trait.
When multiple traits are supplied, a single trait-agnostic GRM is computed
once and shared across all traits. rrBLUP::kin.blup() is fitted per
trait using this shared GRM. Block importance is summarised across traits
with per-trait columns plus cross-trait aggregates.
Usage
run_haplotype_prediction(
geno_matrix,
snp_info,
blocks,
blues,
id_col = "id",
blue_col = "blue",
blue_cols = NULL,
importance_rule = c("any", "all", "mean"),
top_n = NULL,
min_freq = 0.01,
min_snps = 3L,
bend = TRUE,
verbose = TRUE
)Arguments
- geno_matrix
Numeric matrix (individuals x SNPs), values 0/1/2/NA. Row names must be genotype IDs.
- snp_info
Data frame with columns
SNP,CHR,POS.- blocks
Block table from
run_Big_LD_all_chr.- blues
Pre-adjusted phenotype values. See section above for accepted formats.
- id_col
Name of the ID column when
bluesis a data frame. Default"id".- blue_col
Name of the BLUE column for single-trait data frames. Default
"blue".- blue_cols
Character vector of trait column names for multi-trait data frames. Default
NULL– all numeric non-ID columns are used.- importance_rule
How to set the combined
importantflag for multi-trait results:"any"(default): TRUE if important for >= 1 trait."all": TRUE only if important for all traits."mean": TRUE ifvar_scaled_mean>= 0.9.
Ignored for single-trait runs (single-trait
importantis always scaled variance >= 0.9).- top_n
Integer or
NULL. Max haplotype alleles per block. DefaultNULL.- min_freq
Minimum haplotype allele frequency. Default
0.01.- min_snps
Minimum SNPs per block. Default
3L.- bend
Logical. Add ridge to GRM diagonal. Default
TRUE.- verbose
Logical. Print progress. Default
TRUE.
Value
Named list. For single-trait runs, contains:
blocks,diversity,hap_matrix,haplotypes,GCore pipeline outputs.
n_blocks,n_hap_columnsSummary counts.
n_traitsInteger: 1 for single-trait.
traitsCharacter vector of trait names.
solver_usedCharacter: always
"rrBLUP".gebvNamed numeric vector of GEBV.
snp_effectsPer-SNP additive effects.
local_gebvMatrix (individuals x blocks).
block_importanceData frame with
block_id, coordinates,var_local_gebv,var_scaled,important.n_train,n_predictTraining/prediction counts.
For multi-trait runs, the same structure but gebv,
snp_effects, local_gebv are named lists (one per trait),
and block_importance contains additional per-trait and cross-trait
columns as described in the Block importance section.
block_importance_list is added as a named list of single-trait
block importance data frames.
Input format for blues
Accepts any of three formats:
A named numeric vector – single trait, names are genotype IDs:
c(G001 = 4.2, G002 = 3.8).id_col/blue_colare ignored.A data frame with one trait column – single trait,
id_colandblue_colspecify the columns.A data frame with multiple trait columns – multi-trait,
id_colnames the ID column,blue_colsnames the trait columns (orNULLto use all numeric non-ID columns).A named list of named numeric vectors – multi-trait with potentially different individuals per trait:
list(YLD = c(G001=4.2, ...), DIS = c(G001=0.3, ...)).
Multi-trait GBLUP solver strategy
All traits are fitted using rrBLUP::kin.blup() in a per-trait loop
sharing the same GRM. This produces numerically identical results to
sommer::mmes() for single-trait models (verified empirically to 4
decimal places), while avoiding the multi-trait cbind() formula
which fails in sommer <= 4.4.5 (Armadillo fixed-size matrix error in the
C++ coefficient matrix construction). Because all traits use the same GRM,
cross-trait block importance values are directly comparable.
solver_used is always "rrBLUP".
Block importance – single trait
Var(local GEBV) is computed per block after backsolving per-SNP
effects from GEBV. Blocks are scaled to [0,1]; those with scaled variance
>= 0.9 are flagged important. This follows Tong et al. (2024).
Block importance – multi-trait
Per-trait columns var_scaled_<trait> and important_<trait>
are added for every trait. Cross-trait aggregates:
var_scaled_meanMean scaled variance across all traits. Primary ranking criterion – blocks consistently important across traits are more robust stacking candidates.
n_traits_importantCount of traits for which this block is flagged important.
important_anyTRUE if important for at least one trait.
important_allTRUE if important for all traits.
importantControlled by
importance_rule.
References
Tong J et al. (2025). Haplotype stacking to improve stability of stripe rust resistance in wheat. Theoretical and Applied Genetics 138:267. doi:10.1007/s00122-025-05045-0
Tong J et al. (2024). Stacking beneficial haplotypes from the Vavilov wheat collection. Theoretical and Applied Genetics 137:274. doi:10.1007/s00122-024-04784-w
Endelman JB (2011). Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4:250-255. doi:10.3835/plantgenome2011.08.0024
Covarrubias-Pazaran G (2016). Genome-assisted prediction of quantitative traits using the R package sommer. PLOS ONE 11:e0156744. doi:10.1371/journal.pone.0156744
Examples
if (FALSE) { # \dontrun{
library(LDxBlocks)
be <- read_geno("mydata.vcf.gz")
blocks <- run_Big_LD_all_chr(be, CLQcut = 0.70)
geno <- read_chunk(be, seq_len(be$n_snps))
rownames(geno) <- be$sample_ids
colnames(geno) <- be$snp_info$SNP
# -- Single trait: named vector ---------------------------------------------
blues_vec <- c(G001 = 4.21, G002 = 3.87, G003 = 5.14)
res <- run_haplotype_prediction(geno, be$snp_info, blocks, blues = blues_vec)
res$block_importance[res$block_importance$important, ]
sort(res$gebv, decreasing = TRUE)
# -- Single trait: data frame -----------------------------------------------
blues_df <- read.csv("blues.csv") # columns: Genotype, YLD_BLUE
res <- run_haplotype_prediction(geno, be$snp_info, blocks,
blues = blues_df,
id_col = "Genotype",
blue_col = "YLD_BLUE")
# -- Multi-trait: data frame ------------------------------------------------
blues_mt <- read.csv("blues_mt.csv") # columns: id, YLD, DIS, PHT
res_mt <- run_haplotype_prediction(geno, be$snp_info, blocks,
blues = blues_mt,
id_col = "id",
blue_cols = c("YLD","DIS","PHT"),
importance_rule = "any")
res_mt$n_traits # 3
res_mt$solver_used # "rrBLUP"
res_mt$block_importance[
res_mt$block_importance$important_any,
c("block_id","var_scaled_YLD","var_scaled_DIS","var_scaled_mean",
"n_traits_important")]
# -- Multi-trait: named list (different individuals per trait) --------------
res_mt2 <- run_haplotype_prediction(geno, be$snp_info, blocks,
blues = list(
YLD = c(G001 = 4.2, G002 = 3.8),
DIS = c(G001 = 0.3, G003 = 0.7)
))
} # }