Skip to contents

All code chunks in this vignette have eval = FALSE because they require either optional packages (SNPRelate, BEDMatrix) or large data files not shipped with the package. Copy and adapt them to your own data.


1. Why scale matters: the original Big-LD limitation

The original Big_LD() function requires the complete genotype dataset in RAM simultaneously. For modern WGS panels this is untenable:

Dataset SNPs Individuals RAM required (naive)
Typical SNP chip 50,000 5,000 ~2 GB
Medium WGS panel 1,000,000 500 ~4 GB
Large WGS panel 3,000,000 204 ~5 GB
Rice mega-GWAS panel 3,000,000 5,000 ~120 GB
Bovine WGS 10,000,000 2,000 ~160 GB

2. How LDxBlocks solves it: never-full-genome memory model

Format Memory strategy Peak RAM for 3M × 204 panel
VCF / HapMap Auto-converted to GDS cache; streams per window ~60 MB per window
Numeric CSV Two-pass chunked reader; 50,000-row chunks ~50 MB per chunk
GDS snpgdsGetGeno() per window ~60 MB per window
PLINK BED BEDMatrix OS memory-mapping ~60 MB per window
bigmemory filebacked.big.matrix; OS pages on demand ~0.8 MB per window

Peak RAM formula:

RAM_peak ~ n_samples × subSegmSize × 8 bytes
         = 204 × 1500 × 8 = 2.4 MB  (3M-SNP rice panel)
         = 5000 × 1500 × 8 = 60 MB  (5000-sample cattle panel)

3. GDS backend (SNPRelate)

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("SNPRelate")
library(SNPRelate)
SNPRelate::snpgdsVCF2GDS(
  vcf.fn      = "mydata.vcf.gz",
  out.fn      = "mydata.gds",
  method      = "biallelic.only",
  snpfirstdim = FALSE
)
be_gds <- read_geno("mydata.gds")

blocks <- run_Big_LD_all_chr(
  be_gds,
  method       = "r2",
  CLQcut       = 0.70,
  leng         = 100L,
  subSegmSize  = 1500L,
  n_threads    = 16L,
  min_snps_chr = 100L,
  verbose      = TRUE
)
close_backend(be_gds)

4. WGS-scale acceleration

blocks <- run_Big_LD_all_chr(
  be_gds,
  CLQmode         = "Leiden",      # polynomial O(n log n); guaranteed connected
  CLQcut          = 0.70,
  max_bp_distance = 500000L,       # skip pairs > 500 kb
  subSegmSize     = 500L,
  leng            = 50L,
  checkLargest    = TRUE,
  n_threads       = 16L,
  min_snps_chr    = 100L
)

4.2 Sparse LD computation (max_bp_distance)

At 500 kb, 70–90% of pairs are skipped, reducing O(p²) to near-O(p):

blocks <- run_Big_LD_all_chr(
  be_gds,
  CLQmode         = "Leiden",
  max_bp_distance = 500000L,
  CLQcut          = 0.70,
  n_threads       = 16L
)

4.3 bigmemory backend

be_gds <- read_geno("mydata.gds")
be_bm  <- read_geno_bigmemory(
  be_gds,
  backingfile = "mydata_bm",
  backingpath = "/data/ldxblocks",
  type        = "char",
  verbose     = TRUE
)
close_backend(be_gds)
blocks <- run_Big_LD_all_chr(be_bm, CLQmode = "Leiden", CLQcut = 0.70)
close_backend(be_bm)

5. LD decay analysis on large panels

decay <- compute_ld_decay(
  geno         = be_gds,
  sampling     = "random",
  r2_threshold = "both",
  n_pairs      = 50000L,
  max_dist     = 5000000L,
  fit_model    = "loess",
  n_threads    = 8L,
  verbose      = TRUE
)
decay$critical_r2_param
decay$decay_dist
qtl <- define_qtl_regions(gwas, blocks, snp_info, ld_decay = decay)
install.packages("BEDMatrix")
be_bed <- read_geno("mydata.bed")
blocks <- run_Big_LD_all_chr(be_bed, method = "r2", CLQcut = 0.65,
                              n_threads = 8L, verbose = FALSE)
close_backend(be_bed)

7. Parameter selection

7.1 subSegmSize

Individuals (n) subSegmSize (w) Window RAM
1,000 1,500 12 MB
5,000 1,500 60 MB
5,000 5,000 200 MB
20,000 1,500 240 MB

7.2 leng

For dense WGS panels, reduce leng to 50–100:

blocks <- run_Big_LD_all_chr(be_gds, method = "r2", CLQcut = 0.70,
                              leng = 50L, subSegmSize = 3000L, n_threads = 16L)

8. Haplotype analysis on large datasets

be     <- read_geno("wgs_panel.gds")
blocks <- run_Big_LD_all_chr(be, method = "r2", CLQcut = 0.70,
                              n_threads = 16L, verbose = FALSE)
haps   <- extract_haplotypes(be, be$snp_info, blocks, min_snps = 5)
div    <- compute_haplotype_diversity(haps)
feat   <- build_haplotype_feature_matrix(haps, encoding = "additive_012",
                                          scale_features = TRUE)$matrix
blues  <- read.csv("blues.csv")
pred   <- run_haplotype_prediction(
  geno_matrix = as.matrix(read_chunk(be, seq_len(be$n_snps))),
  snp_info    = be$snp_info,
  blocks      = blocks,
  blues       = blues,
  id_col      = "id",
  blue_col    = "YLD"
)
close_backend(be)

9. Troubleshooting

GDS handle errors on Windows. Avoid FORK-based parallelism; use n_threads > 1 (OpenMP within a single process) instead.

“fewer than min_snps_chr SNPs” messages. Set min_snps_chr = 100 or higher to skip scaffold chromosomes automatically.

CLQD hangs or takes hours. Switch to CLQmode = "Leiden" with CLQcut = 0.70 and max_bp_distance = 500000L.

Processing stalls after “Segment N/N done”. Update to v0.3.1+ which routes overlap resolution through resolve_overlap_cpp() — a 15,000× reduction in per-SNP LD computation cost.

10. References

  • Lawrence M et al. (2013). PLOS Computational Biology 9(8):e1003118.
  • VanRaden PM (2008). Journal of Dairy Science 91(11):4414-4423.