LDxBlocks -- Genome-Wide LD Block Detection • LDxBlocks

Motivation
Summary
Relationship to the original Big-LD algorithm
- 3.1. Computational core: R loops replaced by C++
- 3.2. Memory model: never-full-genome
- 3.3. Kinship correction: rV²
- 3.4. Singleton SNP handling
- 3.5. Bug fix: zero-row assignment
- 3.6. Downstream pipeline
- 3.7. What is kept exactly
Installation
Quick start
Input formats
- 6.1. Genotype input format
- 6.2. Phenotype input format
  - 6.2.1. Format 1 — Named numeric vector
  - 6.2.2. Format 2 — Data frame, single trait
  - 6.2.3. Format 3 — Data frame, multiple traits
  - 6.2.4. Format 4 — Named list
- 6.3. ID matching rules
- 6.4. Preparing BLUEs from raw phenotype data
Statistical background
- 7.1. MAF filtering
- 7.2. Genotype preparation
- 7.3. Subsegmentation
- 7.4. Clique detection (CLQD)
- 7.5. Block construction
LD metrics: r² versus rV²
Clique detection mode (CLQmode)
- 9.1. Mode 1 — Density
- 9.2. Mode 2 — Maximal
- 9.3. Mode 3 — Louvain
- 9.4. Mode 4 — Leiden
- 9.5. Summary comparison
- 9.6. Recommended configurations
Haplotype analysis
- 10.1. Phase-free haplotype extraction
- 10.2. Haplotype diversity metrics
- 10.3. Haplotype feature matrix for genomic prediction
- 10.4. Haplotype-based genomic prediction pipeline
- 10.5. Cross-validation and prediction accuracy
- 10.6. Between-population comparison
- 10.7. Haplotype network visualisation
- 10.8. Multi-environment stability
- 10.9. Candidate region export
- 10.10. Per-allele effect decomposition
- 10.11. Sliding-window diversity scan
- 10.12. True diplotype inference
- 10.13. Rare-allele collapsing
- 10.14. Cross-panel harmonisation
- 10.15. Haplotype association testing
- 10.16. Breeding decision tools
- 10.17. Cross-population effect concordance (haplotype)
- 10.18. Cross-population effect concordance (external GWAS)
- 10.19. Within-block and between-block epistasis detection
Parameter auto-tuning
Scale strategies and backends
- 12.1. The LDxBlocks_backend interface
- 12.2. Memory requirements by configuration
- 12.3. Recommended configurations by dataset size
Full pipeline walkthrough
Function reference
- 14.1. Main pipeline
- 14.2. I/O
- 14.3. LD computation
- 14.4. C++ kernels (direct access)
- 14.5. Haplotype analysis
- 14.6. Analysis extensions
- 14.7. True haplotype inference and harmonisation
- 14.8. Haplotype association testing
- 14.9. Breeding decision tools
- 14.10. Utilities
Output objects
- 15.1. run_Big_LD_all_chr() — block table
- 15.2. run_ldx_pipeline() — named list
- 15.3. tune_LD_params() — named list
- 15.4. extract_haplotypes() — named list
- 15.5. compute_haplotype_diversity() — data.frame
- 15.6. read_geno() — LDxBlocks_backend
- 15.7. test_block_haplotypes() — LDxBlocks_haplotype_assoc
- 15.8. estimate_diplotype_effects() — LDxBlocks_diplotype
- 15.9. score_favorable_haplotypes() — data frame
- 15.10. summarize_parent_haplotypes() — data frame
- 15.11. compare_block_effects() — LDxBlocks_effect_concordance
- 15.12. compare_gwas_effects() — LDxBlocks_effect_concordance
Memory and performance notes
- 16.1. C++ core
- 16.2. Never-full-genome memory model
- 16.3. OpenMP thread count
Documentation
Citation
Contributing
License
References

1. Motivation

Linkage disequilibrium (LD) block detection is a foundational step in modern genomic analyses. Knowing which SNPs co-segregate as a unit determines how GWAS results are interpreted, how haplotypes are defined for population genetics, and how genomic prediction models should be structured. Despite its importance, most implementations of LD block detection suffer from three problems that limit their usefulness in practice.

The kinship problem. Classical LD estimators (r²) assume independence between individuals. In livestock, crop, or family-based human cohorts this assumption is systematically violated. Cryptic relatedness inflates pairwise correlations, causing LD to appear stronger than it is and blocks to be drawn too broadly or in the wrong places. The kinship-adjusted squared correlation rV² (Mangin et al. 2012, Heredity 108:285-291) corrects for this by whitening the genotype matrix with the inverse square root of the genomic relationship matrix (GRM):

rV²ᵢⱼ = Cov(Xᵛᵢ, Xᵛⱼ)² / [Var(Xᵛᵢ) · Var(Xᵛⱼ)]

where Xᵛ = V⁻¹⁄² G̃ is the kinship-whitened, mean-centred genotype matrix; V = ZZᵀ / (2 Σⱼ pⱼqⱼ) is the VanRaden (2008) GRM; and V⁻¹⁄² is its inverse square root.

The scale problem. The original Big-LD implementation (Kim et al. 2018) contains no compiled code. For modern whole-genome sequencing panels with 2-10 million markers the inner loops are prohibitively slow, and loading the full genotype matrix before detection is impossible on most workstations. LDxBlocks addresses this with a C++/Armadillo computational core compiled via Rcpp, OpenMP-parallelised LD computation, a unified multi-format I/O layer, and a strict never-full-genome memory model.

The pipeline gap. The original Big-LD stops at block boundaries. LDxBlocks adds a complete downstream pipeline: statistical phasing via Beagle 5.x with bigmemory-backed caching, haplotype extraction, diversity metrics, genomic prediction features, association testing, and breeding decision tools.

2. Summary

LDxBlocks is a complete pipeline for genome-wide LD block detection in related or structured populations, with downstream haplotype analysis and genomic prediction utilities. It extends the Big-LD algorithm of Kim et al. (2018) with 15 core improvements:

Dual LD metric — standard r² (default) and kinship-adjusted rV²
C++/Armadillo core — eleven compiled functions handle all expensive operations
OpenMP parallelism — outer loop of compute_r2_cpp() parallelised
Unified multi-format I/O — numeric CSV, HapMap, VCF/VCF.gz, GDS, PLINK BED, R matrix
Never-full-genome memory model — genome never held in RAM at once for any format
MAF filter in C++ — single O(np) pass with NA imputation
C++ boundary scan — replaces R inner loop in subsegmentation
Sparse r² computation — O(p) cost for large sub-segments
Automatic parameter tuning — tune_LD_params() grid search
Haplotype reconstruction — phase-free and Beagle-phased pathways
Diversity metrics — richness, He, Shannon entropy, dominant haplotype frequency
Prediction feature matrix — multi-locus dosage columns for GBLUP/BayesB/ML
Polynomial community detection — Louvain and Leiden for WGS panels
Sparse LD computation — max_bp_distance restricts pairs to a physical window
Memory-mapped genotype store — read_geno_bigmemory() with OS-page access

3. Relationship to the original Big-LD algorithm

3.1. Computational core: R loops replaced by C++

Original R operation	LDxBlocks C++ function	Speedup
`cor(subgeno)` per CLQD call	`compute_r2_cpp()` + OpenMP	~40× for 1,500-SNP window
`apply(Ogeno, 2, ...)` MAF filter	`maf_filter_cpp()` single pass	~10× for 100k+ SNPs
`cor()` inside boundary-scan loop	`boundary_scan_cpp()` compiled	~20× per chromosome
`r2Mat[r2Mat >= CLQcut^2] <- 1`	`build_adj_matrix_cpp()`	eliminates intermediate allocation
Single-column correlation	`col_r2_cpp()`	used in boundary scan helper
Sparse within-window r²	`compute_r2_sparse_cpp()`	avoids O(p²) for large segments
LD-informed overlap resolution	`resolve_overlap_cpp()`	15,700× per-SNP reduction on chr1
Haplotype string building	`build_hap_strings_cpp()`	~20-50× per block
Block-to-SNP interval lookup	`block_snp_ranges_cpp()`	O(p + n_blocks) single sweep
Chromosome haplotype extraction	`extract_chr_haplotypes_cpp()`	strings + freq tabulation in one OpenMP pass
Call-rate filter + imputation	`impute_and_filter_cpp()`	single O(n×p) pass

3.2. Memory model: never-full-genome

The full genotype matrix is never held in RAM at once for any format. Numeric dosage CSV is read in pre-allocated 50,000-row chunks. VCF and HapMap auto-convert to a streaming GDS cache. GDS and PLINK BED backends load only the SNP window per CLQD call. gc(FALSE) is called after each chromosome.

3.3. Kinship correction: rV²

method = "rV2" replaces every pairwise correlation with the kinship-whitened equivalent. The whitening factor A is computed once per chromosome from the VanRaden (2008) GRM via get_V_inv_sqrt() (Cholesky or eigendecomposition), then applied to the centred genotype matrix. In related populations, rV² blocks are typically 10-30% smaller and more precisely delimited than r² blocks.

3.4. Singleton SNP handling

singleton_as_block = TRUE collects SNPs that receive NA from CLQD() and appends them to the block table as single-SNP entries (start == end). Default FALSE preserves original behaviour.

3.5. Bug fix: zero-row assignment

When a sub-segment contains no valid cliques, nowLDblocks has zero rows. LDxBlocks wraps the assignment with if (nrow(nowLD) > 0L), preventing the backwards-sequence error in the original code.

3.6. Downstream pipeline

Capability	Original Big-LD / gpart	LDxBlocks
Block detection	Yes (core algorithm)	Yes (same algorithm + C++ + rV²)
Statistical phasing	No	`run_ldx_pipeline(phase = TRUE)` calls `phase_with_beagle()` internally. Phased hap1/hap2/dosage cached as bigmemory backends for fast restart. `read_phased_vcf()` reads pre-phased VCF output.
Haplotype extraction	No	`extract_haplotypes()` — phased and unphased, backend streaming
Diversity metrics	No	`compute_haplotype_diversity()` — He, Shannon, richness, f_max
Post-GWAS QTL mapping	No	`define_qtl_regions()` — pleiotropic block detection
Genomic prediction features	No	`build_haplotype_feature_matrix()`
Output writers	No	Numeric dosage matrix, nucleotide character matrix, diversity CSV
Parameter auto-tuning	No	`tune_LD_params()` — grid search against GWAS marker coverage
Multi-format I/O	PLINK, VCF (gpart)	Numeric CSV, HapMap, VCF, GDS, BED, R matrix via unified backend
WGS-scale streaming	Partial (gpart GDS)	Full never-full-genome model for all formats

3.7. What is kept exactly

The interval graph modelling of LD bins (cliques of strong pairwise LD SNPs)
CLQD(): bin vector assignment via maximal clique enumeration and greedy density-priority selection
constructLDblock(): maximum-weight independent set via dynamic programming
appendSGTs(): rare-SNP appending logic
cutsequence.modi(): boundary-scan logic and forced-split fall-back
All CLQmode = "Density" and CLQmode = "Maximal" clique scoring
The clstgap physical distance splitting within cliques
Block table column format for drop-in compatibility with downstream tools

4. Installation

Install from GitHub with vignettes (recommended):

install.packages("remotes")
remotes::install_github("FAkohoue/LDxBlocks",
  build_vignettes = TRUE,
  dependencies    = TRUE
)

Required dependencies (installed automatically):

install.packages(c("Rcpp", "RcppArmadillo", "igraph", "data.table", "dplyr"))

Optional dependencies:

# GDS backend — required for .gds files; recommended for panels > 2 M SNPs
BiocManager::install("SNPRelate")

# PLINK BED backend
install.packages("BEDMatrix")

# Kinship-adjusted rV² (method = "rV2")
install.packages(c("AGHmatrix", "ASRgenomics"))

# Parallel parameter tuning
install.packages("future.apply")

# LD block visualisation
install.packages("ggplot2")

# Bigmemory-backed genotype store
install.packages("bigmemory")

5. Quick start

library(LDxBlocks)

# ── Option A: end-to-end pipeline ────────────────────────────────────────────
result <- run_ldx_pipeline(
  geno_source    = "mydata.vcf.gz",
  out_dir        = "ldx_results",
  out_blocks     = "ldx_results/blocks.csv",
  out_diversity  = "ldx_results/diversity.csv",
  out_hap_matrix = "ldx_results/hap_matrix.csv",
  CLQcut         = 0.70,
  n_threads      = 8L,
  verbose        = TRUE
)

result$blocks          # data.frame of LD blocks
result$diversity       # per-block diversity table
result$haplotypes      # named list of haplotype strings

# ── Option B: block detection only ───────────────────────────────────────────
be <- read_geno("mydata.vcf.gz")
blocks <- run_Big_LD_all_chr(be, method = "r2", CLQcut = 0.70, n_threads = 8L)
close_backend(be)

# ── Option C: with Beagle phasing ────────────────────────────────────────────
# Place beagle.jar in out_dir first, then:
result <- run_ldx_pipeline(
  geno_source        = "mydata.vcf.gz",
  out_dir            = "ldx_results",
  out_blocks         = "ldx_results/blocks.csv",
  out_hap_matrix     = "ldx_results/hap_matrix.csv",
  phase              = TRUE,
  beagle_jar         = "ldx_results/beagle.jar",
  beagle_threads     = 8L,
  beagle_java_mem_gb = 16L,
  beagle_seed        = 42L,
  CLQcut             = 0.70,
  n_threads          = 8L
)

result$phase_method   # "beagle"
result$phased_vcf     # path to Beagle-phased VCF.gz

6. Input formats

read_geno() accepts a path to a genotype file (or an in-memory R matrix). The format is auto-detected from the file extension when format is not supplied.

6.1. Genotype input format

Format	Extension	`format =`
Numeric dosage	`.csv`, `.txt`	`"numeric"`
HapMap	`.hmp.txt`	`"hapmap"`
VCF / bgzipped VCF	`.vcf`, `.vcf.gz`	`"vcf"`
SNPRelate GDS	`.gds`	`"gds"`
PLINK binary	`.bed` (+ `.bim`, `.fam`)	`"bed"`
R matrix	(in-memory)	`"matrix"`

Numeric dosage format — one row per SNP; columns SNP, CHR, POS, REF, ALT followed by one column per sample, values in {0, 1, 2, NA}.

VCF — standard VCF v4.2. Both phased (0|1) and unphased (0/1) GT fields are accepted. Multi-allelic sites use the first ALT allele. Missing calls (./.) become NA.

Chromosome normalisation. The leading prefix chr, Chr, or CHR is stripped from chromosome labels at read time. Polyploid sub-genome labels (1A, 2D) are preserved verbatim.

6.2. Phenotype input format

The blues argument of run_haplotype_prediction() and prepare_gblup_inputs() accepts four formats:

Format 1 — Named numeric vector:

blues <- c(G001 = 4.21, G002 = 3.87, G003 = 5.14)

Format 2 — Data frame, single trait:

blues <- read.csv("blues.csv")   # any column names
res <- run_haplotype_prediction(geno, snp_info, blocks,
                                 blues = blues, id_col = "id", blue_col = "YLD")

Format 3 — Data frame, multiple traits:

res <- run_haplotype_prediction(geno, snp_info, blocks,
                                 blues     = blues_df,
                                 id_col    = "id",
                                 blue_cols = c("YLD", "DIS", "PHT"))

Format 4 — Named list (different individuals per trait):

blues <- list(
  YLD = c(G001 = 4.21, G002 = 3.87),
  DIS = c(G001 = 0.32, G003 = 0.28)
)

6.3. ID matching rules

Genotype IDs in phenotype data must match rownames(geno_matrix) exactly (case-sensitive). The function takes the intersection, issues a message for individuals present in only one source, and errors if no common individuals are found.

6.4. Preparing BLUEs from raw phenotype data

# Example with lme4 (single environment)
library(lme4)
m   <- lmer(YLD ~ (1|id) + (1|rep), data = field_data)
blues_lme4 <- data.frame(id  = rownames(coef(m)$id),
                          YLD = coef(m)$id[, 1])

7. Statistical background

7.1. MAF filtering

AF_i = (Σⱼ g_ij) / (2 n_i),    MAF_i = min(AF_i, 1 − AF_i)

SNPs with MAF_i < τ_maf (default 0.05) are removed. Both operations run in a single O(np) C++ pass by maf_filter_cpp().

7.2. Genotype preparation

Standard r² path: Pearson r² via column standardisation followed by BLAS-level matrix multiplication in compute_r2_cpp().

Kinship-adjusted rV² path: VanRaden (2008) GRM computed via AGHmatrix::Gmatrix(), bent and conditioned via ASRgenomics::G.tuneup(), then A = R⁻¹ (Cholesky) or Q Λ⁻¹⁄² Qᵀ (eigen) applied to mean-centred genotypes before the same compute_r2_cpp() kernel.

7.3–7.5. Subsegmentation, clique detection, and block construction

See the full pipeline walkthrough and the original Kim et al. (2018) paper for algorithmic details.

8. LD metrics: r² versus rV²

	r²	rV²
Population type	Random mating, unrelated	Livestock, inbred lines, family-based cohorts
Computational cost	O(np) prep + O(p²) per window	O(n²p) GRM + O(n³) Cholesky + O(np) whitening
RAM	Proportional to one window	n×n GRM + whitening factor held per chromosome
Block accuracy	Slightly inflated in related populations	Correct for structured populations
External dependencies	None	AGHmatrix, ASRgenomics

blocks_r2  <- run_Big_LD_all_chr(be, method = "r2",  CLQcut = 0.70)
blocks_rv2 <- run_Big_LD_all_chr(be, method = "rV2", CLQcut = 0.70, kin_method = "chol")

9. Clique detection mode (CLQmode)

	Density	Maximal	Louvain	Leiden
Algorithm	Bron-Kerbosch + density score	Bron-Kerbosch + size score	Community detection	Community detection
Complexity	Exponential (worst case)	Exponential (worst case)	O(n log n)	O(n log n)
Connected communities?	Yes	Yes	Post-processed in LDxBlocks	Yes (formal guarantee)
WGS feasible?	Only with `max_bp_distance`	Only with `max_bp_distance`	Yes	Yes
Recommended for	Chip panels (< 100k SNPs/chr)	Sparse chip panels	Not recommended (use Leiden)	WGS panels (2M+ SNPs)

# WGS recommended configuration
blocks <- run_Big_LD_all_chr(be, CLQmode = "Leiden", CLQcut = 0.80,
                              max_bp_distance = 500000L,
                              subSegmSize = 500L, leng = 50L,
                              checkLargest = TRUE, n_threads = n_threads)

10. Haplotype analysis

10.1. Phase-free haplotype extraction

extract_haplotypes() concatenates each individual’s allele codes (0, 1, or 2) for all SNPs within a block into a single character string:

Individual i, block b covering SNPs j1–j4:
  haplotype = paste0(g[i,j1], g[i,j2], g[i,j3], g[i,j4]) = "0120"

A note on phasing. True gametic haplotypes require statistical phasing. The strings produced by extract_haplotypes() are diploid allele strings, not gametic phases. Within a high-LD block these strings are nearly 1:1 with true haplotype classes (Calus et al. 2008) and are sufficient for diversity analysis and genomic prediction.

If gametic phases are required, set phase = TRUE in run_ldx_pipeline(), which calls Beagle 5.x after LD block detection and caches phased hap1/hap2/dosage as bigmemory backends for fast restart. Place beagle.jar in out_dir before running.

10.2. Haplotype diversity metrics

compute_haplotype_diversity() returns four metrics per block:

Richness (k): number of unique haplotype strings.

Expected heterozygosity (Hₑ):

Hₑ = n/(n−1) · (1 − Σᵢ pᵢ²)

Shannon entropy (H’):

H' = −Σᵢ pᵢ log₂(pᵢ)

Dominant haplotype frequency (f_max): frequency of the most common haplotype. Values near 1.0 indicate a selective sweep or strong founder effect.

10.3. Haplotype feature matrix for genomic prediction

build_haplotype_feature_matrix() converts haplotype strings to a numeric dosage matrix. For each block, the top_n most frequent haplotypes are selected and each individual receives:

Phased data: 0 (neither gamete), 1 (one gamete — heterozygous), or 2 (both gametes — homozygous). True allele copy number.
Unphased data: 0 (absent) or 1 (present). The value 2 is never produced because the two chromosomes cannot be distinguished — an individual homozygous for an allele and one heterozygous for it produce different dosage strings and are treated as distinct allele classes.

feat  <- build_haplotype_feature_matrix(haps, scale_features = TRUE)$matrix
G_hap <- tcrossprod(feat) / ncol(feat)   # haplotype GRM

10.4. Haplotype-based genomic prediction pipeline

run_haplotype_prediction() produces:

VanRaden GRM from haplotype features
GEBV for all genotyped individuals via REML-based GBLUP
Per-SNP additive effects backsolved from GEBV (Tong et al. 2025)
Local haplotype GEBV per block per individual
Block importance table ranked by scaled Var(local GEBV)

# Single trait
res <- run_haplotype_prediction(geno, snp_info, blocks,
                                 blues = my_blues, id_col = "id", blue_col = "YLD")

# Multiple traits
res_mt <- run_haplotype_prediction(geno, snp_info, blocks,
                                    blues     = my_blues_df,
                                    id_col    = "id",
                                    blue_cols = c("YLD", "DIS"))

# Integrate GWAS evidence
qtl      <- define_qtl_regions(gwas, blocks, snp_info, p_threshold = 5e-8)
priority <- integrate_gwas_haplotypes(qtl, res, diversity = div)
priority[priority$priority_score == 3, ]

10.5. Cross-validation and prediction accuracy

cv_haplotype_prediction() estimates prediction accuracy via k-fold cross-validation. Returns a LDxBlocks_cv object with pa_mean (PA ± SD per trait), pa_summary (per-fold), k, and n_rep.

10.6. Between-population comparison

compare_haplotype_populations() computes per-block Weir-Cockerham FST, maximum allele frequency difference, and a chi-squared test (Monte Carlo, B = 2000). divergent flag: FST > 0.1 AND p < 0.05.

10.7. Haplotype network visualisation

plot_haplotype_network() draws a minimum-spanning network using igraph::mst(). Nodes sized by frequency; edges weighted by Hamming distance. Optional group colouring.

10.8. Multi-environment stability

run_haplotype_stability() runs Finlay-Wilkinson (1963) regression of per-block local GEBV contributions against the environmental index. Returns b (stability coefficient), b_se, R², s²d, and stable flag.

10.9. Candidate region export

export_candidate_regions() converts define_qtl_regions() output to "bed" (UCSC-compatible, 0-based), "csv", or "biomart" format.

10.10. Per-allele effect decomposition

decompose_block_effects() aggregates per-SNP additive effects into a per-haplotype-allele effect table: effect = sum(SNP_effect × allele_dosage).

10.11. Sliding-window diversity scan

scan_diversity_windows() computes He, Shannon entropy, n_eff_alleles, and dominant haplotype frequency in sliding windows across the genome, independently of LD block boundaries.

10.12. True diplotype inference

infer_block_haplotypes() converts raw haplotype strings into a structured per-individual, per-block diplotype table. For phased input (from run_ldx_pipeline(phase = TRUE)), phase_ambiguous is always FALSE.

10.13. Rare-allele collapsing

collapse_haplotypes() merges alleles below min_freq using "rare_to_other", "nearest" (Hamming-based), or "tree_based" (UPGMA) strategies. A label_map attribute records every original → collapsed mapping.

10.14. Cross-panel harmonisation

harmonize_haplotypes() anchors allele identity to a reference dictionary, matching by exact string or nearest Hamming neighbour within max_hamming. Novel alleles receive label "<novel>". A harmonization_report attribute reports match quality per block.

10.15. Haplotype association testing

test_block_haplotypes() uses a unified Q+K mixed linear model with simpleM multiple-testing correction (Gao et al. 2008, 2010, 2011). The model jointly corrects for population structure and polygenic kinship:

y = μ + α · x_hap + Σ β_k · PC_k + g + ε, g ~ MVN(0, σ²_g G)

Key parameters:

n_pcs = 0L (default, EMMAX): GRM random effect absorbs all structure
n_pcs = k: top-k GRM eigenvectors as additional fixed effects (Q+K model)
n_pcs = NULL: auto-selected from the GRM scree-plot elbow
optimize_pcs = FALSE: when TRUE, auto-selects n_pcs by fitting null models for k = 0..optimize_pcs_max and minimising the criterion set by optimize_method. More principled than the elbow heuristic for GWAS.
optimize_pcs_max = 10L: maximum PCs evaluated when optimize_pcs = TRUE
optimize_method = c("bic_lambda", "bic", "lambda"): criterion for PC selection (only used when optimize_pcs = TRUE):
- "bic" — minimise BIC of the null REML model
- "lambda" — minimise |λ_GC − 1| (genomic control calibration)
- "bic_lambda" (default, recommended) — hybrid: |λ−1| + 0.01·scaled_BIC. Minimises inflation/deflation while BIC breaks ties toward fewer PCs
sig_metric: which p-value drives the significant flag:
- "p_wald" — raw Wald p-value (use with a pre-corrected threshold)
- "p_fdr" — Benjamini-Hochberg FDR (recommended for discovery)
- "p_simplem" — simpleM Bonferroni-style: min(p × Meff, 1)
- "p_simplem_sidak" — simpleM Šidák-style: 1−(1−p)^Meff (recommended for correlated haplotype predictors)
meff_scope: scope for estimating the effective number of independent tests (Meff):
- "chromosome" (recommended) — separate Meff per chromosome
- "global" — one genome-wide Meff
- "block" — one Meff per LD block (omnibus tests get Meff = 1)
meff_percent_cut: variance threshold for simpleM eigendecomposition (default 0.995)
plot = TRUE: saves PDF plots (not PNG). Three files per run: manhattan_<trait>.pdf, qq_<trait>.pdf, pca_grm.pdf (GRM PCA coloured by phenotype), grm_scree.pdf (eigenvalue scree with selected PC in red)

All four p-value columns (p_wald, p_fdr, p_simplem, p_simplem_sidak) are always present in every output regardless of sig_metric.

estimate_diplotype_effects() also now supports the full correction set with a new sig_metric parameter. All four p-value columns (p_omnibus_adj, p_omnibus_fdr, p_omnibus_simplem, p_omnibus_simplem_sidak) are always present in $omnibus_tests regardless of the chosen sig_metric.

# FDR-based discovery (EMMAX, default)
assoc_fdr <- test_block_haplotypes(
  haps, blues = blues_vec, blocks = blocks,
  sig_metric = "p_fdr", verbose = FALSE
)

# simpleM Šidák with Q+K correction, chromosome-wise Meff
assoc_sm <- test_block_haplotypes(
  haps, blues = blues_vec, blocks = blocks,
  n_pcs            = 3L,
  sig_metric       = "p_simplem_sidak",
  meff_scope       = "chromosome",
  meff_percent_cut = 0.995,
  verbose          = FALSE
)
assoc_sm$meff$trait$allele$chromosome   # per-chromosome effective test counts
head(assoc_sm$block_tests[order(assoc_sm$block_tests$p_omnibus), ])

estimate_diplotype_effects() decomposes variation at each block into additive (a) and dominance (d) components. Dominance ratio d/a classifies gene action: 0 = additive, ±1 = complete dominance, |d/a| > 1 = overdominance.

10.16. Breeding decision tools

score_favorable_haplotypes() produces a genome-wide stacking index for each individual (sum of allele_effect × dosage across blocks, normalised to [0, 1]).

summarize_parent_haplotypes() produces a long-format allele inventory for candidate parents: which alleles they carry, at what dosage, and with what population frequency. is_rare = TRUE flags alleles with frequency < 10%.

10.17. Cross-population effect concordance

compare_block_effects() takes two test_block_haplotypes() result objects from independent populations and computes per-block statistics quantifying how consistently haplotype allele effects replicate across panels.

Key output columns (one row per block per trait):

Column	What it answers
`n_shared_alleles`	Alleles tested in both populations
`effect_correlation`	Pearson r of per-allele effects across populations (NA when < 3 shared alleles)
`direction_agreement`	Fraction of shared alleles with the same effect sign
`directionally_concordant`	`direction_agreement >= direction_threshold` (default 0.75)
`meta_effect` / `meta_SE` / `meta_p`	IVW meta-analytic effect (inverse-variance weighted, same as two-sample MR)
`Q_stat` / `Q_p`	Cochran Q: significant Q means effect sizes differ between populations
`I2`	I² inconsistency (0–100%); > 50% = substantial heterogeneity
`replicated`	`TRUE` when directionally concordant AND Q_p > 0.05 AND enough shared alleles
`boundary_overlap_ratio`	Automatically computed output (not a user-set value): bp(intersection) / bp(union) of the two populations’ block boundaries. Requires `blocks_pop1` / `blocks_pop2` to be supplied; `NA` otherwise
`boundary_warning`	`TRUE` when `boundary_overlap_ratio < boundary_overlap_warn` (the input parameter, default `0.80`)
`match_type`	How the block was matched: `"exact"` (same `block_id` string), `"position"` (matched by genomic overlap), `"pop1_only"` (no Pop2 block overlaps at `overlap_min`), or `NA` when no block tables were supplied

# Use Pop A's block boundaries for Pop B (maximises shared alleles)
haps_B_harm <- harmonize_haplotypes(
  extract_haplotypes(geno_B, snp_info, blocks_A),   # same block coords
  reference = haps_A
)
assoc_A <- test_block_haplotypes(haps_A, blues = blues_A, blocks = blocks_A,
                                  sig_metric = "p_simplem_sidak")
assoc_B <- test_block_haplotypes(haps_B_harm, blues = blues_B, blocks = blocks_A,
                                  sig_metric = "p_simplem_sidak")

# block_match = "id" (default): match by block_id string — fast, backward-compatible
# block_match = "position":      match by genomic interval overlap — handles different
#                                 LD block boundaries between populations
conc <- compare_block_effects(
  assoc_A, assoc_B,
  pop1_name             = "PopA",
  pop2_name             = "PopB",
  blocks_pop1           = blocks_A,   # required for boundary_overlap_ratio + position matching
  blocks_pop2           = blocks_B,   # supply Pop B's OWN block table (may differ from A)
  block_match           = "position", # recommended when blocks differ between populations
  overlap_min           = 0.50,       # min IoU for two blocks to be considered the same region
  direction_threshold   = 0.75,
  boundary_overlap_warn = 0.80
)
# $concordance$match_type: "exact" | "position" | "pop1_only"
# "position" rows: same QTL region, different boundary definitions
conc$concordance[conc$concordance$replicated, ]   # replicated blocks
head(conc$shared_alleles)                          # per-allele IVW detail
print(conc)

10.18. Cross-population effect concordance (external GWAS)

When GWAS was run externally (GAPIT, TASSEL, FarmCPU, PLINK, or any other tool), compare_gwas_effects() compares block-level lead-SNP effects between two populations. It accepts either raw GWAS data frames or the pre-mapped output of define_qtl_regions().

Key differences from compare_block_effects():

	`compare_block_effects()`	`compare_gwas_effects()`
Input	`test_block_haplotypes()` results	GWAS tables or `define_qtl_regions()` output
Unit per block	Multiple haplotype alleles	One lead SNP
`effect_correlation`	Pearson r across alleles	Always NA
`direction_agreement`	Fraction of alleles with same sign	0 or 1
`Q_stat` / `I2`	Cochran Q, df = n_alleles − 1	Always NA (df = 0)
`replicated`	dir_concordant AND Q_p > 0.05	dir_concordant AND meta_p ≤ 0.05

SE derivation. Many GWAS tools omit the standard error. When SE is absent, compare_gwas_effects() derives it from the z-score: SE = |BETA| / |z|, z = Φ⁻¹(P/2). The se_derived_pop1 / se_derived_pop2 output columns flag which population required this derivation.

# Path 1: pre-mapped (recommended) — most auditable
qtl_A <- define_qtl_regions(gwas_A, blocks, snp_info, p_threshold = 5e-8)
qtl_B <- define_qtl_regions(gwas_B, blocks, snp_info, p_threshold = 5e-8)

conc <- compare_gwas_effects(
  qtl_pop1      = qtl_A,
  qtl_pop2      = qtl_B,
  blocks_pop1   = blocks_A,       # Pop A's block table
  blocks_pop2   = blocks_B,       # Pop B's block table (may differ from A)
  block_match   = "position",     # match by genomic overlap — handles different boundaries
  overlap_min   = 0.50,           # min IoU for a valid match
  pop1_name     = "PopA",
  pop2_name     = "PopB"
)

# Path 2: raw GWAS + blocks (convenience — calls define_qtl_regions internally)
conc2 <- compare_gwas_effects(
  gwas_pop1     = gwas_A,
  gwas_pop2     = gwas_B,
  blocks_pop1   = blocks_A,
  blocks_pop2   = blocks_B,
  snp_info_pop1 = snp_info,
  # snp_info_pop2 = NULL: reuses snp_info_pop1 when marker panels are shared
  pop1_name     = "PopA",
  pop2_name     = "PopB",
  block_match   = "position",  # match blocks by genomic interval overlap
  overlap_min   = 0.50,        # blocks with IoU < 0.5 become "pop1_only"
  p_threshold   = 5e-8,
  beta_col      = "BETA",      # column name — change to match your GWAS tool
  se_col        = "SE",        # NULL or absent: derived from z-score automatically
  p_col         = "P"
)

# Output uses the same LDxBlocks_effect_concordance class
# GWAS-specific extra columns in $concordance:
#   lead_snp_pop1, lead_snp_pop2  — which SNP tagged the block in each pop
#   lead_p_pop1, lead_p_pop2      — lead SNP p-values
#   se_derived_pop1/pop2          — TRUE when SE was derived from z-score
#   both_pleiotropic              — TRUE when block is pleiotropic in both pops
conc$concordance[conc$concordance$replicated, ]
print(conc)

Shared marker panel. When both populations were genotyped on the same array or sequenced with the same reference, set snp_info_pop2 = NULL (default) to reuse snp_info_pop1. When populations have different marker sets, supply snp_info_pop2 explicitly.

10.19. Within-block and between-block epistasis detection

LDxBlocks provides three functions for epistasis detection that extend the haplotype association framework. All operate on GRM-corrected REML residuals from the same null model as test_block_haplotypes(), ensuring population-structure-corrected tests throughout.

scan_block_epistasis() tests all C(p,2) SNP pairs within each significant block for pairwise interaction on GRM-corrected REML residuals. The model for each pair is:

y = μ + aᵢxᵢ + aⱼxⱼ + aaᵢⱼ(xᵢ × xⱼ) + ε

Restricting to significant blocks avoids the genome-wide explosion: for 15 significant blocks with ~200 SNPs each, the total number of tests is ~300,000 rather than ~4.4 billion. Multiple testing is corrected by Bonferroni and simpleM Sidak within each block, where Meff is estimated from the eigenspectrum of the pairwise interaction column matrix.

scan_block_by_block_epistasis() is a trans-haplotype epistasis scan that tests each significant haplotype allele against every allele at all other blocks. This is conceptually analogous to a trans-eQTL scan but for phenotypic haplotype interactions — it detects genetic background dependence where a resistance haplotype at one locus only functions in the presence of a specific background at another locus. For 25 significant alleles × 17,943 total alleles, the scan involves ~450,000 tests corrected by Bonferroni.

fine_map_epistasis_block() fine-maps a single block by identifying the specific interacting SNP pairs. For blocks with p ≤ 200 SNPs it runs an exhaustive pairwise scan. For larger blocks it uses LASSO with pairwise interaction terms (glmnet::cv.glmnet(), lambda.1se), which avoids the multiple-testing burden while identifying the most influential pairs.

# Within-block epistasis scan (significant blocks only)
epi_within <- scan_block_epistasis(
  assoc              = assoc,          # test_block_haplotypes() result
  geno_matrix        = res$geno_matrix,
  snp_info           = snp_info,
  blocks             = blocks,
  blues              = blues_list,
  haplotypes         = haps,
  trait              = "BL",
  sig_blocks         = NULL,           # NULL = use significant_omnibus blocks
  max_snps_per_block = 300L,
  sig_metric         = "p_simplem_sidak",
  sig_threshold      = 0.05
)
print(epi_within)
epi_within$results[epi_within$results$significant, ]
epi_within$scan_summary

# Between-block trans-haplotype epistasis scan
epi_between <- scan_block_by_block_epistasis(
  assoc         = assoc,
  haplotypes    = haps,
  blues         = blues_list,
  blocks        = blocks,
  trait         = "BL",
  sig_alleles   = NULL,    # NULL = significant alleles from assoc
  sig_threshold = 0.05
)
print(epi_between)
epi_between$results[epi_between$results$significant, ]

# Fine-map a single block (auto-dispatches: pairwise or LASSO)
fine <- fine_map_epistasis_block(
  block_id   = "block_12_1054210_1086071",
  geno_matrix = res$geno_matrix,
  snp_info    = snp_info,
  blocks      = blocks,
  y_resid     = my_reml_residuals,   # from .fit_null_reml() or null model
  method      = "auto",              # pairwise <= 200 SNPs, lasso otherwise
  sig_threshold = 0.05
)
head(fine)

11. Parameter auto-tuning

tune_LD_params() selects CLQcut (and optionally other parameters) minimising in priority order: unassigned GWAS markers, forced assignments, number of blocks, deviation from target median block size.

result <- tune_LD_params(
  geno_matrix    = my_geno,
  snp_info       = my_snp_info,
  gwas_df        = my_gwas,
  prefer_perfect = TRUE,
  target_bp_band = c(5e4, 5e5),
  parallel       = FALSE,
  seed           = 42L
)
result$best_params
result$gwas_assigned

12. Scale strategies and backends

12.1. The LDxBlocks_backend interface

be    <- read_geno("mydata.bed")       # opens PLINK BED
chunk <- read_chunk(be, col_idx)       # n_samples × length(col_idx) matrix
close_backend(be)                      # release file handle

12.2. Memory requirements by configuration

Configuration	Peak RAM
Plain matrix, r², all in RAM	~40 GB (500 ind, 10M SNPs)
GDS or BED backend, r², `subSegmSize = 1500`	~300 MB
Any backend, rV², `method = "rV2"`	~4 GB
bigmemory, r², `subSegmSize = 500`	~0.8 MB/window

12.3. Recommended configurations by dataset size

Markers	Individuals	Recommended configuration
< 100 k	any	`method = "r2"`, `format = "matrix"`
100 k – 500 k	< 5,000	`method = "rV2"` for structured populations
100 k – 2 M	any	`method = "r2"`, `format = "vcf"` or `"bed"`
2 M – 10 M	any	`CLQmode = "Leiden"`, `max_bp_distance = 500000L`, `format = "gds"`
> 10 M	any	`CLQmode = "Leiden"`, `max_bp_distance = 500000L`, `read_geno_bigmemory()`

13. Full pipeline walkthrough

Step	Action	Key parameter(s)
1	Accept genotype backend or wrap plain matrix	`read_geno()`, `format =`
2	Extract per-chromosome genotype slice	`read_chunk(backend, chr_idx)`
3	MAF filter + monomorphic removal in C++	`MAFcut`, `maf_filter_cpp()`
4	Centre (r²) or centre + whiten (rV²)	`method`, `kin_method`, `prepare_geno()`
4b	Optional: Beagle phasing (`phase = TRUE`). Calls `phase_with_beagle()` on the original VCF after imputation/filtering. Reads the phased VCF via `read_phased_vcf()`, aligns samples and SNPs via CHR+POS+REF+ALT+SNP composite key, and caches hap1/hap2/dosage as bigmemory backends (`ldxblocks_bm_phased_hap1`, `_hap2`, `_dos`) in `bigmemory_path`. On restart the VCF is not re-read — backends reattach from disk via fingerprint match (VCF path + mtime + n_snps). Haplotype extraction then uses gametic strings (`g1\\|g2`).	`phase`, `beagle_jar`, `beagle_threads`, `beagle_java_mem_gb`, `beagle_seed`, `beagle_ref_panel`, `beagle_map_file`
5	C++ boundary scan — find weak-LD cut points	`leng`, `subSegmSize`, `boundary_scan_cpp()`
6	Divide chromosome into sub-segments	`subSegmSize`
7	Per sub-segment: compute r² or rV² matrix in C++	`CLQcut`, `compute_r2_cpp()`
8	Build binary adjacency matrix in C++	`CLQcut`, `build_adj_matrix_cpp()`
9	Find communities/cliques	`CLQmode`, `checkLargest`, `max_bp_distance`
10	Greedy clique assignment → bin vector	`split`, `clstgap`, `CLQD()`
11	MWIS block construction	internal
12	Re-merge across forced cut-points	automatic
13	Merge overlapping blocks	automatic
14	Map indices to bp position and rsID	`SNPinfo`
15	Re-index over full SNP set including monomorphics	automatic
16	Optionally append rare SNPs	`appendrare`

Detailed example with all parameters:

blocks <- run_Big_LD_all_chr(
  # ── Genotype input ──────────────────────────────────────────────────────────
  geno_matrix  = be,

  # ── LD metric ───────────────────────────────────────────────────────────────
  method       = "r2",          # "r2" (default) or "rV2"
  kin_method   = "chol",        # "chol" (default) or "eigen" for rV2

  # ── Clique detection ────────────────────────────────────────────────────────
  CLQcut       = 0.70,
  CLQmode      = "Density",     # "Density"/"Maximal"/"Louvain"/"Leiden"
  clstgap      = 40000L,
  split        = FALSE,

  # ── Subsegmentation ─────────────────────────────────────────────────────────
  leng         = 200L,
  subSegmSize  = 1500L,

  # ── Filtering ───────────────────────────────────────────────────────────────
  MAFcut       = 0.05,
  appendrare   = FALSE,

  # ── Large window heuristics ─────────────────────────────────────────────────
  checkLargest = FALSE,

  # ── Parallelism ─────────────────────────────────────────────────────────────
  n_threads    = 8L,
  digits       = -1L,

  # ── Chromosome minimum ──────────────────────────────────────────────────────
  min_snps_chr = 10L,

  # ── Reproducibility ─────────────────────────────────────────────────────────
  seed         = 42L,
  verbose      = TRUE
)

Full run_ldx_pipeline() example with phasing:

result <- run_ldx_pipeline(
  # ── Genotype input ──────────────────────────────────────────────────────────
  geno_source    = "mydata.vcf.gz",   # VCF required for phase = TRUE
  out_dir        = "ldx_results",

  # ── Output paths ────────────────────────────────────────────────────────────
  out_blocks     = "ldx_results/blocks.csv",
  out_diversity  = "ldx_results/diversity.csv",
  out_hap_matrix = "ldx_results/hap_matrix.csv",
  hap_format     = "numeric",

  # ── Phasing (optional — requires VCF input and beagle.jar in out_dir) ───────
  phase              = TRUE,
  beagle_jar         = "ldx_results/beagle.jar",
  beagle_threads     = 8L,
  beagle_java_mem_gb = 16L,          # -Xmx16g; NULL = JVM default
  beagle_seed        = 42L,          # integer seed for reproducibility
  beagle_ref_panel   = NULL,         # phased reference VCF (optional)
  beagle_map_file    = NULL,         # genetic map for improved accuracy

  # ── Filtering & imputation ───────────────────────────────────────────────────
  maf_cut        = 0.05,
  impute         = "mean_rounded",

  # ── LD block detection ───────────────────────────────────────────────────────
  CLQcut         = 0.70,
  method         = "r2",
  CLQmode        = "Leiden",
  subSegmSize    = 500L,
  leng           = 50L,
  max_bp_distance = 500000L,
  n_threads      = 8L,

  # ── Haplotype extraction ─────────────────────────────────────────────────────
  min_snps_block = 3L,
  top_n          = 5L,

  # ── Bigmemory caching (restart-safe) ─────────────────────────────────────────
  use_bigmemory  = TRUE,
  bigmemory_path = "ldx_results/bm_cache",
  bigmemory_type = "char",

  # ── General ──────────────────────────────────────────────────────────────────
  verbose = TRUE
)

14. Function reference

14.1. Main pipeline

Function	Description
`run_ldx_pipeline()`	Recommended end-to-end pipeline. One call from genotype source to haplotype matrix, diversity table, and block CSV. Accepts `phase = TRUE` for Beagle phasing (place `beagle.jar` in `out_dir`). Phased data cached as bigmemory backends for fast restart.
`run_Big_LD_all_chr()`	Chromosome-wise LD block detection. Accepts both plain matrices and `LDxBlocks_backend` objects.
`Big_LD()`	Core per-chromosome segmentation. Called internally by `run_Big_LD_all_chr()`. Not exported.
`tune_LD_params()`	Grid-search auto-tuner minimising unassigned GWAS marker placements.

14.2. I/O

Function	Description
`run_ldx_pipeline()`	See Main pipeline. Also handles all I/O for the end-to-end run.
`phase_with_beagle()`	Statistical phasing via Beagle 5.x. Called internally by `run_ldx_pipeline(phase = TRUE)`. Exposes all Beagle parameters: `java_mem_gb`, `map_file`, `chrom`, `seed`, `burnin`, `iterations`, `window`, `overlap`. Log written to `out_prefix.log`. Requires `beagle.jar`.
`read_phased_vcf()`	Read a phased VCF (pipe-separated `0\\|1` GT fields) into `list(hap1, hap2, dosage, snp_info, sample_ids, phased=TRUE)`. Called internally after Beagle phasing.
`read_geno()`	Auto-dispatch genotype reader. Returns an `LDxBlocks_backend` object.
`read_chunk()`	Extract a genotype slice (n_samples × width) from any backend type.
`close_backend()`	Release file handles. No-op for in-memory backends.
`read_geno_bigmemory()`	Create a file-backed memory-mapped store from any source.

14.3. LD computation

Function	Description
`compute_r2()`	Standard r² matrix via C++ Armadillo + optional OpenMP.
`compute_rV2()`	rV² on a pre-whitened matrix.
`prepare_geno()`	Centre (r²) or centre + whiten (rV²).
`get_V_inv_sqrt()`	Whitening factor A such that AVA’ = I.

14.4. C++ kernels (direct access)

Function	Description
`compute_r2_cpp()`	Full r² matrix. OpenMP outer loop.
`maf_filter_cpp()`	MAF + monomorphic filter in one O(np) C++ pass.
`build_adj_matrix_cpp()`	LD threshold → 0/1 integer adjacency matrix.
`col_r2_cpp()`	r² of one query column against all others.
`compute_r2_sparse_cpp()`	Sparse r² for pairs within a bp distance window.
`boundary_scan_cpp()`	Cross-boundary LD scan. Returns 0/1 vector of valid cut positions.
`resolve_overlap_cpp()`	BLAS DGEMM scoring of overlapping block boundaries.

14.5. Haplotype analysis

Function	Description
`extract_haplotypes()`	Phase-free diploid allele strings per block × individual. For phased input (from `run_ldx_pipeline(phase = TRUE)`), produces gametic strings `g1\\|g2`.
`compute_haplotype_diversity()`	Per-block richness, He, n_eff_alleles, Shannon, sweep_flag.
`build_haplotype_feature_matrix()`	Haplotype dosage matrix for genomic prediction. Phased: 0/1/2. Unphased: 0/1.
`compute_haplotype_grm()`	VanRaden GRM from haplotype feature matrix.
`decode_haplotype_strings()`	Decode dosage strings to nucleotide sequences.
`write_haplotype_numeric()`	Write haplotype dosage matrix to file.
`write_haplotype_character()`	Write nucleotide character matrix to file.
`write_haplotype_diversity()`	Write diversity table to CSV.
`define_qtl_regions()`	Map GWAS hits to LD blocks; detect pleiotropic blocks.
`backsolve_snp_effects()`	Derive per-SNP effects from GEBV (Tong et al. 2025).
`compute_local_gebv()`	Local haplotype GEBV per block per individual.
`prepare_gblup_inputs()`	Align phenotype data and haplotype GRM for external GBLUP solvers.
`run_haplotype_prediction()`	Single or multi-trait Tong et al. (2025) haplotype stacking pipeline.
`integrate_gwas_haplotypes()`	Combine GWAS, variance, and diversity evidence per block.
`rank_haplotype_blocks()`	Unified block ranking across 3 use cases.

14.6. Analysis extensions

Function	Description
`cv_haplotype_prediction()`	K-fold cross-validation for the haplotype GBLUP model.
`compare_haplotype_populations()`	Per-block Weir-Cockerham FST between two sample groups.
`plot_haplotype_network()`	Minimum-spanning network of haplotype alleles for one block.
`run_haplotype_stability()`	Finlay-Wilkinson stability regression across environments.
`export_candidate_regions()`	Convert QTL regions to BED, CSV, or biomaRt format.
`decompose_block_effects()`	Per-haplotype-allele effect table from per-SNP additive effects.
`scan_diversity_windows()`	Sliding-window diversity scan across the genome.

14.7. True haplotype inference and harmonisation

Function	Description
`infer_block_haplotypes()`	Structured diplotype table from raw haplotype strings. Phased input: `phase_ambiguous = FALSE`.
`collapse_haplotypes()`	Merge rare alleles via `"rare_to_other"`, `"nearest"`, or `"tree_based"`.
`harmonize_haplotypes()`	Cross-panel allele label harmonisation using reference dictionary.

14.8. Haplotype association testing

Function	Description
`test_block_haplotypes()`	Block-level association tests via Q+K mixed model with simpleM multiple-testing correction (Gao et al. 2008, 2010, 2011). Parameters `sig_metric`, `meff_scope`, `meff_percent_cut`, `meff_max_cols` control correction method and Meff estimation scope. All four p-value flavours always present in output. Returns `LDxBlocks_haplotype_assoc`.
`estimate_diplotype_effects()`	Additive (a) and dominance (d) effects per block. Returns `LDxBlocks_diplotype`.
`compare_block_effects()`	Cross-population haplotype effect concordance. Computes IVW meta-analytic effects, Cochran Q heterogeneity, I² inconsistency, direction agreement, and replication flag per block. New `block_match` parameter: `"id"` (default, backward-compatible) matches by `block_id` string; `"position"` matches by genomic interval overlap (IoU ≥ `overlap_min`, default 0.50) — handles different LD block boundaries between populations. `boundary_overlap_ratio` is automatically computed (not user-set); `boundary_overlap_warn` (default 0.80) controls `boundary_warning`. Output `match_type` column: `"exact"` / `"position"` / `"pop1_only"`. Returns `LDxBlocks_effect_concordance`.
`compare_gwas_effects()`	Cross-population concordance from external GWAS results (GAPIT, TASSEL, FarmCPU, PLINK, etc.). Accepts raw GWAS data frames or pre-mapped `define_qtl_regions()` output. Derives SE from z-score when absent. `block_match = "position"` handles different block boundaries between populations (same as `compare_block_effects()`). One lead SNP per block; `effect_correlation` and Cochran Q are NA. Same output class as `compare_block_effects()`.
`scan_block_epistasis()`	Within-block pairwise SNP epistasis scan. Tests all C(p,2) SNP pairs within significant blocks on GRM-corrected REML residuals. Model: y = mu + a_ix_i + a_jx_j + aa_ij(x_ix_j) + e. Corrected via Bonferroni and simpleM Sidak within each block (Meff from interaction column eigenspectrum). `sig_metric` controls which correction drives the `significant` flag. Returns `LDxBlocks_epistasis`.
`scan_block_by_block_epistasis()`	Trans-haplotype between-block epistasis scan. Tests significant haplotype alleles against all other block alleles: O(n_sig x n_total_alleles) tests corrected by Bonferroni. Identifies genetic background dependence that single-block analyses cannot detect. Returns `LDxBlocks_block_epistasis`.
`fine_map_epistasis_block()`	Single-block epistasis fine-mapping. Dispatches to exhaustive pairwise scan (p <= 200 SNPs) or LASSO interaction search via `glmnet` (p > 200 SNPs). Requires pre-computed REML residuals (`y_resid`).

14.9. Breeding decision tools

Function	Description
`score_favorable_haplotypes()`	Genome-wide haplotype stacking index per individual.
`summarize_parent_haplotypes()`	Long-format allele inventory for candidate parents.

14.10. Utilities

Function	Description
`summarise_blocks()`	Per-chromosome and genome-wide block size summary statistics.
`plot_ld_blocks()`	ggplot2 block diagram coloured by block size or chromosome.

15. Output objects

15.1. `run_Big_LD_all_chr()` — block table

Column	Type	Description
`start`	integer	Index of the first SNP in the block (full SNP set).
`end`	integer	Index of the last SNP.
`start.rsID`	character	SNP identifier at the block start.
`end.rsID`	character	SNP identifier at the block end.
`start.bp`	numeric	Base-pair position of the block start.
`end.bp`	numeric	Base-pair position of the block end.
`CHR`	character	Chromosome label (normalised, no `chr` prefix).
`length_bp`	integer	`end.bp - start.bp + 1`.

15.2. `run_ldx_pipeline()` — named list

Element	Type	Description
`blocks`	data.frame	LD block table (same columns as Section 15.1).
`diversity`	data.frame	Per-block diversity metrics (same columns as Section 15.4).
`hap_matrix`	matrix	Haplotype feature matrix (individuals × haplotype columns).
`hap_matrix_info`	data.frame	Column metadata for `hap_matrix`.
`haplotypes`	named list	Raw haplotype strings per block (same structure as Section 15.3).
`geno_matrix`	matrix	Cleaned, imputed genotype matrix (individuals × SNPs).
`snp_info_filtered`	data.frame	SNP metadata after MAF filtering.
`phased_vcf`	character	Path to Beagle-phased VCF.gz in `out_dir`. `NULL` when `phase = FALSE`.
`phased_backend_desc`	character	Path to `ldxblocks_bm_phased_dos.desc` for reattachment. `NULL` when `use_bigmemory = FALSE` or `phase = FALSE`.
`phase_method`	character	`"beagle"` when Beagle was run; `"unphased"` otherwise.
`n_blocks`	integer	Total number of LD blocks detected.
`n_hap_columns`	integer	Number of columns in `hap_matrix`.

Bigmemory phased backends (written to bigmemory_path when use_bigmemory = TRUE and phase = TRUE):

File	Content	Type
`ldxblocks_bm_phased_hap1.bin/.desc`	Gamete 1 alleles (individuals × SNPs)	`char`, 0/1
`ldxblocks_bm_phased_hap2.bin/.desc`	Gamete 2 alleles (individuals × SNPs)	`char`, 0/1
`ldxblocks_bm_phased_dos.bin/.desc`	Dosage = hap1 + hap2 (individuals × SNPs)	`char`, 0/1/2
`ldxblocks_bm_phased_snpinfo.rds`	Aligned SNP metadata	—
`ldxblocks_bm_phased_sampleids.rds`	Sample IDs in aligned order	—
`ldxblocks_bm_phased_params.rds`	Fingerprint for cache validation	—

On subsequent runs with the same bigmemory_path and unchanged phased VCF, backends are reattached from disk without re-reading the VCF.

15.3. `tune_LD_params()` — named list

Element	Type	Description
`best_params`	named list	Selected parameter values.
`score_table`	data.frame	All grid combinations and scores.
`perfect_table`	data.frame or NULL	Combinations with n_unassigned = 0 and n_forced = 0.
`final_blocks`	data.table	Block table from `best_params`, all chromosomes.
`gwas_assigned`	data.frame	Input GWAS data with `LD_block` column added. Entries ending in `*` denote forced (nearest-block) assignments.

15.4. `extract_haplotypes()` — named list

Element	Type	Description
`block_<start>_<end>`	character vector	One haplotype string per individual.
`attr(., "block_info")`	data.frame	block_id, CHR, start_bp, end_bp, n_snps.

15.5. `compute_haplotype_diversity()` — data.frame

Column	Description
`block_id`	Block name matching `names(haplotypes)`.
`CHR`	Chromosome (normalised).
`start_bp`, `end_bp`	Block coordinates in base pairs.
`n_snps`	Number of SNPs in the block.
`n_ind`	Individuals with non-missing haplotypes.
`n_haplotypes`	Richness: number of unique haplotype strings.
`He`	Expected heterozygosity (Nei 1973), sample-size corrected.
`Shannon`	Shannon entropy in bits.
`n_eff_alleles`	Effective number of alleles = 1/Σpᵢ².
`freq_dominant`	Frequency of the most common haplotype.
`sweep_flag`	TRUE when freq_dominant ≥ 0.90.
`phased`	Logical: was phased input used?

15.6. `read_geno()` — LDxBlocks_backend

Element	Type	Description
`type`	character	`"numeric"`, `"hapmap"`, `"vcf"`, `"gds"`, `"bed"`, or `"matrix"`.
`n_samples`	integer	Number of individuals.
`n_snps`	integer	Number of SNPs.
`sample_ids`	character	Individual identifiers.
`snp_info`	data.frame	SNP metadata: SNP, CHR, POS, REF, ALT.

15.7. `test_block_haplotypes()` — LDxBlocks_haplotype_assoc

$allele_tests — one row per allele per block per trait:

Column	Type	Description
`block_id`, `CHR`, `start_bp`, `end_bp`, `trait`	—	Identifiers
`allele`	character	Haplotype allele string identifier
`allele_freq_tested`	numeric [0,1]	Allele frequency among tested individuals
`effect`	numeric	Additive effect on de-regressed scale
`SE`	numeric	Standard error
`t_stat`	numeric	t-statistic
`p_wald`	numeric	Two-sided Wald p-value (raw)
`p_fdr`	numeric	Benjamini-Hochberg FDR-adjusted p-value
`Meff`	numeric	Effective number of independent tests (simpleM, scope = `meff_scope`)
`alpha_simplem`	numeric	simpleM Bonferroni-style significance threshold: α / Meff
`alpha_simplem_sidak`	numeric	simpleM Šidák-style threshold: 1 − (1−α)^(1/Meff)
`p_simplem`	numeric	simpleM Bonferroni-style adjusted p-value: min(p × Meff, 1)
`p_simplem_sidak`	numeric	simpleM Šidák-style adjusted p-value: 1 − (1−p)^Meff
`significant`	logical	TRUE when the p-value chosen by `sig_metric` ≤ `sig_threshold`

All p-value columns are always present regardless of sig_metric. The significant flag is the only column that changes based on sig_metric.

$block_tests — one row per block per trait:

Column	Type	Description
`n_alleles_tested`	integer	Alleles above `min_freq` tested jointly
`F_stat`	numeric	Omnibus F-statistic
`df_LRT`	integer	Numerator df = n_alleles_tested
`p_omnibus`	numeric	Raw omnibus p-value
`p_omnibus_fdr`	numeric	BH FDR-adjusted omnibus p-value
`p_omnibus_adj`	numeric	Bonferroni-style per-trait adjustment (backward-compat)
`var_explained`	numeric [0,1]	Proportion of de-regressed variance explained
`Meff`	numeric	Block-level effective test count (from block-summary PC1 matrix)
`alpha_simplem`	numeric	Block-level simpleM Bonferroni threshold
`alpha_simplem_sidak`	numeric	Block-level simpleM Šidák threshold
`p_omnibus_simplem`	numeric	simpleM Bonferroni-style adjusted omnibus p-value
`p_omnibus_simplem_sidak`	numeric	simpleM Šidák-style adjusted omnibus p-value
`significant_omnibus`	logical	TRUE when the p-value chosen by `sig_metric` ≤ `sig_threshold`

Additional return list elements:

Element	Description
`sig_metric`	Which p-value was used for `significant` / `significant_omnibus`
`meff_scope`	Scope used for Meff estimation (`"chromosome"`, `"global"`, or `"block"`)
`meff_percent_cut`	Variance threshold used for simpleM eigendecomposition
`meff`	Named list of Meff summaries per trait: `$allele$global`, `$allele$chromosome`, `$allele$block`, `$block$global`, `$block$chromosome`
`pc_model_selection`	`data.frame` with one row per k tested when `optimize_pcs = TRUE`: `n_pcs`, `BIC`, `lambda_gc`, `score`, `selected`. `NULL` when `optimize_pcs = FALSE`

Plots produced when plot = TRUE (all saved as PDF):

File	Description
`manhattan_<trait>.pdf`	Manhattan plot — one per trait
`qq_<trait>.pdf`	QQ plot — one per trait
`pca_grm.pdf`	PCA of GRM eigenvectors in PC1 × PC2, coloured by first trait phenotype
`grm_scree.pdf`	Scree plot of GRM eigenvalues (up to PC30); red bar and vertical line at selected k

15.8. `estimate_diplotype_effects()` — LDxBlocks_diplotype

$omnibus_tests — one row per LD block per trait (updated columns):

Column	Type	Description
`block_id`, `trait`	—	Identifiers
`n_diplotypes`	integer	Diplotype classes with ≥ `min_n_diplotype` individuals
`F_stat`	numeric	F-statistic (one-way ANOVA on GRM-corrected residuals)
`df1`, `df2`	integer	Numerator and denominator degrees of freedom
`p_omnibus`	numeric	Raw p-value
`p_omnibus_adj`	numeric	Plain Bonferroni × n_blocks_per_trait (backward-compat)
`p_omnibus_fdr`	numeric	BH-FDR per trait
`p_omnibus_simplem`	numeric	simpleM Bonferroni-style: min(p × Meff, 1)
`p_omnibus_simplem_sidak`	numeric	simpleM Šidák-style: 1 − (1−p)^Meff (recommended)
`Meff`	numeric	Effective number of blocks (block-summary PC1 eigenspectrum)
`significant`	logical	TRUE when the p-value chosen by `sig_metric` < `sig_threshold`

All five p-value columns are always present regardless of sig_metric.

$dominance_table — one row per allele pair per block per trait:

Column	Type	Description
`allele_A`, `allele_B`	character	The two alleles (A ≤ B alphabetically)
`mean_AA`, `mean_AB`, `mean_BB`	numeric	Diplotype class means
`a`	numeric	Additive effect: (mean_BB − mean_AA) / 2
`d`	numeric	Dominance deviation: mean_AB − midpoint
`d_over_a`	numeric or NA	Dominance ratio. 0 = additive; ±1 = complete dominance; \|d/a\| > 1 = overdominance
`overdominance`	logical	TRUE when \|d/a\| > 1

15.9. `score_favorable_haplotypes()` — data frame

Column	Type	Description
`id`	character	Individual identifier
`stacking_index`	numeric [0,1]	Genome-wide normalised score
`n_blocks_scored`	integer	Blocks with at least one matched allele effect
`mean_block_score`	numeric	Raw mean per-block score
`rank`	integer	Rank by stacking_index (1 = highest)
`score_<block_id>`	numeric	Per-block score columns

15.10. `summarize_parent_haplotypes()` — data frame

Column	Type	Description
`id`, `block_id`, `CHR`, `start_bp`, `end_bp`	—	Identifiers
`allele`	character	Haplotype allele string
`dosage`	integer {0,1,2}	Copies carried. Phased: 0/1/2. Unphased: 0/1.
`allele_freq`	numeric [0,1]	Population frequency in full panel
`allele_effect`	numeric or NA	Effect from allele_effects argument
`is_rare`	logical	TRUE when allele_freq < 0.10

15.11. `compare_block_effects()` — LDxBlocks_effect_concordance

$concordance — one row per block per trait:

Column	Type	Description
`block_id`, `CHR`, `start_bp`, `end_bp`, `trait`	—	Identifiers
`n_alleles_pop1`, `n_alleles_pop2`	integer	Alleles tested in each population before intersection
`n_shared_alleles`	integer	Alleles present (by string match) in both populations
`enough_shared`	logical	`n_shared_alleles >= min_shared_alleles`
`effect_correlation`	numeric	Pearson r of per-allele effects across populations (NA when n_shared < 3)
`direction_agreement`	numeric [0,1]	Fraction of shared alleles with the same effect sign
`directionally_concordant`	logical	`direction_agreement >= direction_threshold`
`meta_effect`	numeric	IVW meta-analytic effect (weighted mean of per-allele IVW estimates)
`meta_SE`	numeric	SE of IVW estimate
`meta_z`	numeric	meta-analytic z-score
`meta_p`	numeric	Two-sided p-value of meta-analytic effect
`Q_stat`	numeric	Cochran Q heterogeneity statistic
`Q_df`	integer	Degrees of freedom (n_shared_alleles − 1)
`Q_p`	numeric	p-value of Q under chi-squared; significant = heterogeneity between populations
`I2`	numeric [0,100]	I² inconsistency: max(0, (Q − df) / Q × 100). > 50% = substantial heterogeneity
`replicated`	logical	`enough_shared AND directionally_concordant AND Q_p > 0.05`
`boundary_overlap_ratio`	numeric [0,1]	Automatically computed output from `blocks_pop1` / `blocks_pop2`: bp(intersection) / bp(union). `NA` when block tables not supplied
`boundary_warning`	logical	`TRUE` when `boundary_overlap_ratio < boundary_overlap_warn` (the input parameter, default 0.80); flags blocks where differing LD structure may make haplotype strings non-comparable

$shared_alleles — one row per shared allele per block per trait:

Column	Type	Description
`allele`	character	Haplotype allele string
`effect_pop1`, `SE_pop1`, `p_wald_pop1`	numeric	Effect, SE, and p-value from population 1
`effect_pop2`, `SE_pop2`, `p_wald_pop2`	numeric	Effect, SE, and p-value from population 2
`direction_agree`	logical	Same sign in both populations?
`ivw_effect`	numeric	Per-allele IVW combined effect (weighted mean of pop1 and pop2 by 1/SE²)
`ivw_SE`	numeric	SE of IVW combined effect
`match_type`	character	How this block was matched: `"exact"` (same block_id), `"position"` (matched by genomic IoU ≥ `overlap_min`), `"pop1_only"` (no Pop2 block overlapped), or `NA` when no block tables supplied. Set when `block_match = "position"`.

15.12. `compare_gwas_effects()` — LDxBlocks_effect_concordance

Same output class as compare_block_effects() (Section 15.11). Additional columns in $concordance specific to external GWAS input:

Column	Type	Description
`lead_snp_pop1`, `lead_snp_pop2`	character	Lead SNP ID from each population (same SNP = same LD tag; different = different proxies of the same QTL region)
`lead_p_pop1`, `lead_p_pop2`	numeric	Lead SNP p-values
`se_derived_pop1`, `se_derived_pop2`	logical	`TRUE` when SE was derived from `BETA` and `P` via z-score rather than read directly
`both_pleiotropic`	logical	`TRUE` when the block is pleiotropic in both populations (requires `pleiotropic` column from `define_qtl_regions()`)

Columns that differ in meaning from compare_block_effects():

Column	In `compare_block_effects()`	In `compare_gwas_effects()`
`n_shared_alleles`	Haplotype alleles in both pops	Always 1 (one lead SNP per block)
`effect_correlation`	Pearson r across alleles	Always `NA` (needs ≥ 3 alleles)
`direction_agreement`	Fraction of alleles with same sign	0 or 1 only
`Q_stat`, `Q_p`, `I2`	Cochran Q heterogeneity test	Always `NA` (df = 0)
`replicated`	dir_concordant AND Q_p > 0.05	dir_concordant AND meta_p ≤ 0.05
`match_type`	`"exact"` / `"position"` / `"pop1_only"` / `NA`	Same semantics as in `compare_block_effects()`. `block_match = "position"` applies to both functions.

The $shared_alleles data frame contains one row per block per trait with lead_snp_pop1, lead_snp_pop2 instead of allele.

16. Memory and performance notes

16.1. C++ core

The eleven compiled functions in src/ld_core.cpp replace the most expensive R operations. Key speedups:

compute_r2_cpp(): ~40× over pure R for a 1,500 × 1,500 window with 500 individuals.
boundary_scan_cpp(): eliminates ~150,000 small R-level matrix operations for a 50,000-SNP chromosome.
maf_filter_cpp(): ~10× faster for panels > 100,000 SNPs.
resolve_overlap_cpp(): 15,700× per-SNP reduction on chr1 via BLAS DGEMM scoring with a lazy column cache.

16.2. Never-full-genome memory model

Numeric dosage CSV: two-pass chunked reading. Peak RAM = one 50,000-row chunk.
VCF and HapMap: auto-converted to GDS cache; all access streaming via read_chunk().
GDS and PLINK BED: read_chunk() called once per sub-segment per chromosome.
Chromosome loop: rm() and gc(FALSE) called after each chromosome.

16.3. OpenMP thread count

n_thr  <- parallel::detectCores(logical = FALSE)
blocks <- run_Big_LD_all_chr(be, n_threads = n_thr)

Efficient scaling up to approximately 8-16 threads for subSegmSize = 1500.

17. Documentation

Full documentation, function reference, and tutorials:

https://FAkohoue.github.io/LDxBlocks/

vignette("LDxBlocks-workflow", package = "LDxBlocks")
vignette("LDxBlocks-phasing",  package = "LDxBlocks")

18. Citation

LDxBlocks Development Team (2025).
LDxBlocks: Genome-Wide LD Block Detection, Haplotype Analysis, and Genomic
Prediction Features with Kinship-Adjusted Correlations.
R package version 0.3.1.
https://github.com/FAkohoue/LDxBlocks

Please also cite:

Kim S-A et al. (2018). Bioinformatics 34(4):588-596.
VanRaden PM (2008). Journal of Dairy Science 91(11):4414-4423.
Calus MPL et al. (2008). Genetics 178(1):553-561.
Nei M (1973). PNAS 70(12):3321-3323.
Blondel VD et al. (2008). J. Stat. Mech. P10008.
Traag VA et al. (2019). Scientific Reports 9:5233.

19. Contributing

https://github.com/FAkohoue/LDxBlocks/issues

Before opening a pull request: run devtools::check() with zero errors/warnings, add tests in tests/testthat/test-core.R, rebuild docs with devtools::document().

20. License

21. References

Gao X, Starmer J, Martin ER (2008). A multiple testing correction method for genetic association studies using correlated single nucleotide polymorphisms. Genetic Epidemiology 32:361-369. doi:10.1002/gepi.20310

Gao X, Becker LC, Becker DM, Starmer JD, Province MA (2010). Avoiding the high Bonferroni penalty in genome-wide association studies. Genetic Epidemiology 34:100-105. doi:10.1002/gepi.20430

Gao X (2011). Multiple testing corrections for imputed SNPs. Genetic Epidemiology 35:154-158. doi:10.1002/gepi.20563

Borenstein M, Hedges LV, Higgins JPT, Rothstein HR (2009). Introduction to Meta-Analysis. Wiley.

Higgins JPT, Thompson SG (2002). Quantifying heterogeneity in a meta-analysis. Statistics in Medicine 21(11):1539-1558. doi:10.1002/sim.1186

Kim S-A, Cho C-S, Kim S-R, Bull SB, Yoo Y-J (2018). A new haplotype block detection method for dense genome sequencing data based on interval graph modeling and dynamic programming. Bioinformatics 34(4):588-596.

VanRaden PM (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science 91(11):4414-4423.

Mangin B et al. (2012). Novel measures of linkage disequilibrium that correct the bias due to population structure and relatedness. Heredity 108(3):285-291.

Calus MPL et al. (2008). Accuracy of genomic selection using different methods to define haplotypes. Genetics 178(1):553-561.

de Roos APW, Hayes BJ, Goddard ME (2009). Reliability of genomic predictions across multiple populations. Genetics 183(4):1545-1553.

Nei M (1973). Analysis of gene diversity in subdivided populations. PNAS 70(12):3321-3323.

Tong J et al. (2024). Stacking beneficial haplotypes from the Vavilov wheat collection. Theor. Appl. Genet. 137:274.

Tong J et al. (2025). Haplotype stacking to improve stability of stripe rust resistance in wheat. Theor. Appl. Genet. 138:267.

Blondel VD et al. (2008). Fast unfolding of communities in large networks. J. Stat. Mech. P10008.

Traag VA, Waltman L, van Eck NJ (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports 9:5233.

LDxBlocks — Genome-Wide LD Block Detection, Haplotype Analysis, and Genomic Prediction Features

Table of contents