OptiSparseMET: A Unified Framework for Sparse Multi-Environment Trial Design
Source:vignettes/OptiSparseMET-introduction.Rmd
OptiSparseMET-introduction.Rmd1. Introduction
1.1 The core problem
A multi-environment trial (MET) evaluates a set of candidate lines — varieties, hybrids, inbred lines — across multiple environments (locations, years, or location-year combinations). The goal is to estimate the performance of each line across environments, partition genetic from environment effects, and identify lines with broad or specific adaptation.
The fundamental constraint is that the number of candidate lines \(J\) almost always exceeds the number of plots available in any single environment. Testing all \(J\) lines in all \(I\) environments is infeasible; some form of incomplete testing is unavoidable. The question is not whether to test incompletely, but how to structure that incompleteness so the resulting data still support reliable inference.
Sparse testing is the formal approach. A sparse MET design specifies, for each line \(j\) and environment \(e\), whether that combination appears in the trial. The goal is a structure that is statistically efficient, genetically connected, and operationally deployable given seed and plot constraints.
1.2 What OptiSparseMET does
OptiSparseMET addresses sparse MET design at two linked
levels:
Across-environment allocation — which lines go to which environments, how many times each line is replicated across the trial, and whether the resulting connectivity is sufficient for cross-environment inference.
Within-environment design — how the lines assigned to each environment are arranged in the field, how checks are distributed, how blocks control local heterogeneity, and how spatial gradients are managed.
These two levels are not independent. They are linked through the statistical model, through the information matrix that governs estimation precision, and through the physical constraint that allocation decisions determine what within-environment designs are even possible.
How the two levels are linked
Link 1 — The incidence matrix couples allocation to estimation. In the linear mixed model \(y = X\beta + Zg + e\), the design matrix \(Z\) has entry \(Z_{pj} = 1\) if plot \(p\) is assigned to line \(j\), and \(0\) otherwise. The coefficient matrix for line effects is:
\[C = Z^\top V^{-1} Z - Z^\top V^{-1} X (X^\top V^{-1} X)^{-1} X^\top V^{-1} Z\]
where \(V = ZKZ^\top \sigma_g^2 + R\sigma_e^2\) is the phenotypic variance matrix, \(K\) is the genomic relationship matrix, and \(R\) encodes the residual covariance structure. The precision of every genetic value estimate — and therefore every selection decision — depends on \(C\). The allocation decision (which lines appear where) determines the sparsity pattern of \(Z\), which directly shapes \(C\). The within-environment blocking structure determines \(R\), which enters \(V\) and therefore \(C^{-1}\) as well. You cannot optimize \(C\) by fixing \(Z\) and \(R\) independently: they interact inside \(V\).
Link 2 — Allocation fixes which within-environment designs are feasible. The number of lines assigned to environment \(e\) is \(k_e = \sum_j Z_{pj}\) summed over plots \(p\) in environment \(e\). Once allocation is complete, \(k_e\) is fixed. The within-environment design must then arrange exactly \(k_e\) treatments across the available \(n_{\text{rows}} \times n_{\text{cols}}\) plots. If \(k_e\) is incompatible with the blocking structure — for example, if \(k_e\) is not a multiple of the block size, or if \(k_e\) exceeds field capacity — the within-environment design is infeasible regardless of how good the allocation was. Allocation and field geometry must be co-designed.
Link 3 — Block structure affects cross-environment inference. Consider two environments sharing \(n_{\text{shared}}\) lines. The precision of the cross-environment genetic correlation estimate depends on how well those shared lines are estimated within each environment, which in turn depends on the block efficiency within each environment. Formally, the effective replication of a shared line \(j\) across environments is modulated by the efficiency factor \(e_j\) of the within-environment design:
\[\text{Var}(\hat{g}_j^{(e)}) = \frac{\sigma_e^2}{e_j \, r_j^{(e)}}\]
where \(r_j^{(e)}\) is the number of plots for line \(j\) in environment \(e\) and \(e_j \in (0, 1]\) is the efficiency factor relative to a completely randomized design. A poor within-environment design (low \(e_j\)) inflates the variance of each BLUP, which propagates into the cross-environment covariance estimates and degrades G×E inference even when the allocation incidence structure is ideal. The block design within each environment affects the quality of cross-environment comparisons.
Link 4 — CDmean depends on both levels simultaneously. The CDmean criterion for genomic selection,
\[\text{CDmean} = 1 - \frac{1}{J} \sum_{j=1}^{J} \frac{\text{PEV}_j}{\sigma_g^2}\]
where \(\text{PEV}_j = \left[(K^{-1} + Z^\top R^{-1} Z \, / \, \sigma_e^2)^{-1}\right]_{jj} \sigma_g^{-2}\), depends on \(Z\) through allocation (which environments each line enters) and on \(R\) through within-environment design (how plots are blocked and arranged). A sparse allocation that spreads diverse lines across environments improves the genomic connectivity encoded in the numerator \(K\)-weighted terms, while efficient blocking reduces the residual variance \(R\) that appears in the denominator. Neither level can substitute for the other.
Standard MET tools address one level at a time: either choosing an
allocation strategy without regard to field feasibility, or constructing
within-environment designs without regard to how many lines the
allocation will deliver. This leads to inconsistencies — for example, an
allocation that requires a non-integer number of blocks per replicate,
or a field design that assumes more lines than the allocation provides.
OptiSparseMET makes both levels explicit and coordinates
them: the allocation output specifies exactly which lines enter each
environment, and the within-environment engine receives precisely that
set, ensuring that both the incidence structure and the blocking
structure are optimized consistently within the same statistical
framework.
2. Statistical Framework
2.1 The linear mixed model
Phenotypic observations in plant breeding trials are analyzed using a linear mixed model. For a single environment, a typical form is:
\[y = X\beta + Zg + e\]
where:
- \(y\) is the vector of observed phenotypic values (one per plot)
- \(X\beta\) captures fixed effects: the overall mean, environment effects, check effects, and any fixed treatment contrasts
- \(Zg\) captures random line effects: \(g \sim N(0,\, K\sigma_g^2)\), where \(K\) is a relationship matrix (genomic or pedigree-based) and \(\sigma_g^2\) is the genetic variance
- \(e \sim N(0,\, R\sigma_e^2)\) is the vector of residuals, with \(R\) encoding the spatial covariance structure
The design matrix \(Z\) — whose columns correspond to lines and whose rows correspond to plots — is determined entirely by the experimental design. A good design is one that makes \(Z\) well-conditioned for estimation.
2.2 Why design affects inference
When \(Z\) is sparse, not all lines are observed in all environments. Cross-environment inference then depends on either:
- Design-based connectivity: lines that co-occur in the same environment create direct comparisons; common treatments that appear in every environment create bridges between sites.
- Model-based connectivity: the relationship matrix \(K\) propagates information between genetically related lines, even if they never appear in the same environment.
Both mechanisms are important, but model-based connectivity is less robust because it depends on the assumed covariance structure being correct. A well-designed sparse MET builds in design-based connectivity as a safeguard.
2.3 Efficiency criteria
Given a design, its statistical efficiency can be measured by how
well it supports estimation and prediction of genetic values. The three
main criteria used by OptiSparseMET are:
A-optimality minimizes the average variance of treatment comparison estimates (contrasts). Formally it minimizes \(\text{trace}(C^{-1})\) where \(C\) is the coefficient matrix of the treatment effects. A lower value is better. Its reciprocal, A-efficiency \(= 1 / A_{\text{criterion}}\), is sometimes more interpretable because larger values are better.
D-optimality maximizes the determinant \(|C|\), which corresponds to maximizing the volume of the confidence ellipsoid for treatment estimates. It is more sensitive to the overall spread of information across all contrasts than A-optimality, which focuses on the average. Again, D-efficiency \(= 1 / D_{\text{criterion}}\) is used so larger values are better.
CDmean (coefficient of determination of the mean) is the criterion most directly tied to genomic prediction. It is defined as:
\[\text{CDmean} = 1 - \frac{\overline{\text{PEV}}}{\sigma_g^2}\]
where \(\overline{\text{PEV}}\) is the mean prediction error variance (PEV) of the BLUP estimates and \(\sigma_g^2\) is the genetic variance. CDmean lies in \([0, 1]\); values close to 1 indicate that the design supports precise prediction. It captures how accurately genomic estimated breeding values (GEBVs) can be recovered from the design, incorporating both the relationship matrix \(K\) and the block structure.
For novice readers: Think of CDmean as “what fraction of the true genetic value the design allows you to recover on average.” A CDmean of 0.85 means the design, on average, captures 85% of the genetic signal.
3. Two-Level Design Structure
3.1 Across-environment allocation
The resource identity underlying balanced sparse allocation is:
\[N = J \times r = I \times k\]
where:
| Symbol | Meaning |
|---|---|
| \(N\) | Total number of line-by-environment observations |
| \(J\) | Total number of candidate lines |
| \(r\) | Number of environments each line enters (replication depth) |
| \(I\) | Number of environments |
| \(k\) | Number of lines tested per environment (coverage breadth) |
Given a fixed total resource \(N\), this identity formalizes a fundamental tradeoff: increasing coverage breadth (testing more lines per environment, larger \(k\)) necessarily reduces replication depth (each line appears in fewer environments, smaller \(r\)). No design can simultaneously maximize both.
For novice readers: If you have 100 lines, 4 environments, and enough seed for 400 plots total, you could test 100 lines in 4 environments (full replication, but only if all 400 plots are available per environment), or test 50 lines per environment and have each line appear in 2 environments (\(r = 2, k = 100\)), or test 25 lines per environment and have each appear in 4 environments (\(r = 4, k = 100\)). The product \(J \times r\) always equals \(N = I \times k\).
The practical constraint is that \(k\) is bounded by the number of plots available per environment, and \(r\) is bounded by seed availability. The allocation problem is to choose which lines go to which environments, subject to these bounds.
When common treatments are included, a number \(C\) of lines are assigned to every environment before sparse allocation begins. Their count is subtracted from each environment’s capacity:
\[k_e^* = k_e - C\]
The sparse-allocatable lines number \(J^* = J - C\), and the modified resource identity applies to them:
\[J^* \times r = I \times k^*\]
3.2 Within-environment design
Each environment receives an independent field design constructed by one of two engines:
-
met_prep_famoptg()for block-based repeated-check designs, including augmented, partially replicated (p-rep), and RCBD-type layouts -
met_alpha_rc_stream()for row-column alpha designs suited to large-scale spatial field structure
The choice between them depends on field geometry, check structure, replication goals, and the spatial model anticipated at analysis. They can differ across environments within the same trial.
4. Pipeline Overview
The OptiSparseMET workflow proceeds in four stages, with
an optional shortcut that runs all four together.
STAGE 0 Verify per-environment capacity
suggest_safe_k() OR min_k_for_full_coverage()
|
STAGE 1 Allocate lines across environments
allocate_sparse_met()
|
STAGE 2 Verify replication feasibility given seed inventory
assign_replication_by_seed()
|
STAGE 3 Build within-environment field designs
met_prep_famoptg() OR met_alpha_rc_stream()
(optional: evaluate and optimise each design)
|
STAGE 4 Assemble the combined MET field book
combine_met_fieldbooks()
--- OR: run stages 1-4 together ---
plan_sparse_met_design()
library(DiagrammeR)
grViz("
digraph OptiSparseMET {
graph [layout = dot, rankdir = LR]
node [shape = rectangle, style = rounded, fontname = Helvetica]
A [label = 'Input\nLines Environments\nFamilies / GRM / A\nSeed inventory']
B [label = 'Stage 0: Verify capacity\nsuggest_safe_k()']
C [label = 'Stage 1: Allocation\nallocate_sparse_met()\n(M3 or M4)']
D [label = 'Stage 2: Seed replication\nassign_replication_by_seed()']
E [label = 'Stage 3: Field design\nmet_prep_famoptg()\nor met_alpha_rc_stream()']
F [label = 'Stage 4: Assemble\ncombine_met_fieldbooks()']
G [label = 'Outputs\ncombined_field_book\nenvironment_summary\nefficiency_summary']
A -> B -> C -> D -> E -> F -> G
}
")5. Feasibility Helpers
Before any allocation, it is essential to verify that the chosen
per-environment capacity \(k\) is large
enough to assign every non-common line at least once. Passing a capacity
that is too small causes allocate_sparse_met() to stop with
an informative error.
The minimum capacity for full coverage under equal environment sizes is:
\[k_{\min} = \left\lceil \frac{J^*}{I} \right\rceil + C\]
where \(\lceil \cdot \rceil\) is the ceiling function. This is the floor below which any sparse allocation must leave some lines unassigned.
OptiSparseMET provides three helpers:
| Function | Action |
|---|---|
min_k_for_full_coverage() |
Returns \(k_{\min}\) exactly |
suggest_safe_k() |
Returns \(k_{\min}\) plus a user-defined buffer |
warn_if_k_too_small() |
Emits a non-fatal warning if \(k < k_{\min}\) |
Always call one of these before
allocate_sparse_met().
library(OptiSparseMET)
data("OptiSparseMET_example_data")
x <- OptiSparseMET_example_data
# Strict arithmetic minimum
min_k_for_full_coverage(
n_treatments_total = length(x$treatments), # 120
n_environments = length(x$environments), # 4
n_common_treatments = length(x$common_treatments) # 8; J*=112
)
# min_sparse_slots_per_environment = 28 (= ceil(112/4))
# min_total_entries_per_environment = 36 (= 28 + 8)
# Safe value with buffer of 3
suggest_safe_k(
treatments = x$treatments,
environments = x$environments,
common_treatments = x$common_treatments,
buffer = 3
)
# Returns 39
# Non-fatal check (useful inside scripts)
warn_if_k_too_small(
treatments = x$treatments,
environments = x$environments,
n_test_entries_per_environment = 39,
common_treatments = x$common_treatments
)
# No warning: 4 * (39 - 8) = 124 >= 112 sparse lines5.5 Pipeline inputs: required and optional
Before running any pipeline function, it helps to know exactly what each function needs. The tables below list every input for the four main pipeline functions, distinguishing what is strictly required from what is optional.
allocate_sparse_met() — across-environment
allocation
Required inputs
| Argument | Type | Description |
|---|---|---|
treatments |
character vector | All candidate line IDs (J total) |
environments |
character vector | Environment names (≥ 2) |
allocation_method |
character |
"random_balanced" (M3) or
"balanced_incomplete" (M4) |
n_test_entries_per_environment |
integer | Total entries per environment including common treatments (k) |
Optional inputs
| Argument | Default | Description |
|---|---|---|
target_replications |
inferred | Target environments per sparse line (r); computed from slot identity if NULL |
common_treatments |
none | Lines forced into every environment before sparse allocation |
allow_approximate |
FALSE |
FALSE = strict equal replication; TRUE =
relaxed fallback |
allocation_group_source |
"none" |
Genetic grouping: "Family", "GRM", or
"A"
|
treatment_info |
NULL | Data frame with Treatment and Family
columns (required when
allocation_group_source = "Family") |
GRM |
NULL | Genomic relationship matrix (required when
allocation_group_source = "GRM") |
A |
NULL | Pedigree relationship matrix (required when
allocation_group_source = "A") |
min_groups_per_environment |
NULL | Minimum genetic groups per environment |
min_env_per_group |
NULL | Minimum environments per genetic group |
seed |
NULL | Integer seed for reproducibility |
assign_replication_by_seed() — seed-aware
replication
Required inputs
| Argument | Type | Description |
|---|---|---|
treatments |
character vector | All candidate line IDs |
seed_available |
data frame | Must contain Treatment and SeedAvailable
columns |
seed_required_per_plot |
integer | Seeds needed per plot (scalar or named vector per environment) |
replication_mode |
character |
"augmented", "p_rep", or
"rcbd_type"
|
Optional inputs
| Argument | Default | Description |
|---|---|---|
desired_replications |
2 | Target number of plots per replicated line |
shortage_action |
"downgrade" |
What to do when seed is insufficient: "downgrade",
"exclude", or "error"
|
max_prep |
NULL | Maximum number of p-rep treatments (p-rep mode only) |
priority |
"seed_available" |
Selection criterion for p-rep candidates |
minimum_seed_buffer |
0 | Extra seeds reserved per line beyond the plot requirement |
seed |
NULL | Integer seed for reproducibility |
met_prep_famoptg() — block-based field design
Required inputs
| Argument | Type | Description |
|---|---|---|
check_treatments |
character vector | Check (control) treatment IDs; appear in every block |
check_families |
character vector | Family labels for checks; same length as
check_treatments
|
n_blocks |
integer | Number of incomplete blocks |
n_rows |
integer | Number of field rows |
n_cols |
integer | Number of field columns |
At least one of p_rep_treatments or
unreplicated_treatments must be supplied.
Optional inputs
| Argument | Default | Description |
|---|---|---|
p_rep_treatments |
NULL | Treatments to replicate; typically
rep_plan$p_rep_treatments
|
p_rep_reps |
NULL | Replication count per p-rep line; typically
rep_plan$p_rep_reps
|
p_rep_families |
NULL | Family labels for p-rep treatments |
unreplicated_treatments |
NULL | Treatments to appear once; typically
rep_plan$unreplicated_treatments
|
unreplicated_families |
NULL | Family labels for unreplicated treatments |
replication_mode |
"p_rep" |
"p_rep", "augmented", or
"rcbd_type"
|
cluster_source |
"none" |
Genetic dispersion grouping: "none",
"Family", "GRM", "A"
|
eval_efficiency |
FALSE |
Compute A, D, CDmean efficiency metrics |
order |
"row" |
Plot traversal order: "row", "col", or
"serpentine"
|
seed |
NULL | Integer seed for reproducibility |
met_alpha_rc_stream() — row-column alpha design
Required inputs
| Argument | Type | Description |
|---|---|---|
check_treatments |
character vector | Check treatment IDs; appear in every incomplete block |
check_families |
character vector | Family labels for checks |
entry_treatments |
character vector | Entry (non-check) treatment IDs |
entry_families |
character vector | Family labels for entries |
n_reps |
integer | Number of field replicates |
n_rows |
integer | Number of field rows |
n_cols |
integer | Number of field columns |
Optional inputs
| Argument | Default | Description |
|---|---|---|
min_block_size |
6 | Minimum entries (excluding checks) per incomplete block |
max_block_size |
NULL | Maximum entries per incomplete block |
cluster_source |
"none" |
Genetic dispersion grouping: "none",
"Family", "GRM", "A"
|
eval_efficiency |
FALSE |
Compute A, D, CDmean efficiency metrics |
order |
"row" |
Plot traversal order: "row", "col", or
"serpentine"
|
serpentine |
FALSE |
Reverse alternating rows/columns for physical continuity |
seed |
NULL | Integer seed for reproducibility |
Minimum working example
The absolute minimum to run the full pipeline from allocation to field book:
library(OptiSparseMET)
# Minimum inputs: just lines, environments, and field dimensions
treatments <- paste0("L", sprintf("%03d", 1:120))
envs <- c("E1", "E2", "E3", "E4")
# Stage 0: verify k
k <- suggest_safe_k(treatments, envs, buffer = 3) # 33
# Stage 1: M3 allocation (no common treatments, no grouping)
alloc <- allocate_sparse_met(
treatments = treatments,
environments = envs,
allocation_method = "random_balanced",
n_test_entries_per_environment = k,
seed = 1
)
# Stage 2: seed plan (uniform seed, no shortage)
seed_df <- data.frame(
Treatment = treatments,
SeedAvailable = 100L
)
rep_plan <- assign_replication_by_seed(
treatments = treatments,
seed_available = seed_df,
seed_required_per_plot = 10L,
replication_mode = "augmented"
)
# Stage 3: within-environment design (checks + unreplicated entries)
design <- met_prep_famoptg(
check_treatments = c("CHK1", "CHK2"),
check_families = c("CHECK", "CHECK"),
unreplicated_treatments = rep_plan$unreplicated_treatments,
unreplicated_families = rep("F1", length(rep_plan$unreplicated_treatments)),
n_blocks = 4L, n_rows = 10L, n_cols = 12L,
seed = 1
)
# Stage 4: combine
met_book <- combine_met_fieldbooks(
field_books = list(E1 = design$field_book)
)6. Allocation Strategies
6.1 The random balanced strategy (M3)
The "random_balanced" method is a coverage-first
stochastic allocation inspired by the M3 method of Montesinos-López et
al. (2023). It operates in two phases:
Phase 1 (coverage guarantee): Every non-common line
is assigned to at least one environment. Lines are processed in random
order; each is placed in the environment with the most remaining
capacity, with a preference for environments where the line’s genetic
group is not yet represented when allocation_group_source
is active.
Phase 2 (replication filling): Remaining slots are
filled by repeatedly selecting lines with the largest replication
deficit relative to target_replications, subject to
group-balance penalties.
This strategy is appropriate when: - Environment capacities are heterogeneous - Exact BIBD parameters are not achievable for the trial dimensions - Some flexibility in the incidence structure is acceptable
Unlike the original M3, which allocates environment by environment and may silently leave some lines with no assignment, this implementation guarantees full coverage before attempting any additional replication.
6.2 The balanced incomplete strategy (M4)
The "balanced_incomplete" method implements the M4
allocation of Montesinos-López et al. (2023). It enforces two structural
guarantees:
Condition 1 — Equal replication: every non-common line appears in exactly \(r\) environments. This ensures no line is systematically under-evaluated and is the defining property that distinguishes M4 from M3.
Condition 2 — Equal environment sizes: every environment receives exactly \(k^*\) sparse lines. This ensures environments are comparable and the resource identity \(J^* \times r = I \times k^*\) holds exactly.
These two conditions are what matter in plant breeding programs where thousands of candidate lines are tested across a few environments. Enforcing them means every line gets the same number of chances to be evaluated — a fundamental fairness and precision requirement.
For novice readers: Equal replication means that if two breeders each select the “top 10% of lines” based on the trial data, they are working from the same quality of information for every line. Without equal replication, lines tested more often have lower-variance estimates and a systematic advantage in selection.
allow_approximate = FALSE (the default) enforces both
conditions strictly. The slot identity \(J^*
\times r = I \times k^*\) must hold exactly, and the function
stops with a clear error if it does not. Use
check_balanced_incomplete_feasibility() to verify the slot
identity before allocating:
# With k=39, J*=112, I=4, r=1: available slots = 4*(39-8) = 124,
# required slots = 112*1 = 112. 124 ≠ 112 -> slot identity fails.
check_balanced_incomplete_feasibility(
n_treatments_total = 120,
n_environments = 4,
n_test_entries_per_environment = 39,
target_replications = 1,
n_common_treatments = 8
)
# feasible = FALSE (available = 124, required = 112, difference = 12)
# Adjust k so that J*×r = I×k*: e.g. k=36 -> k*=28, 4*28=112=112*1.
# With k=36: slot identity holds.
check_balanced_incomplete_feasibility(
n_treatments_total = 120,
n_environments = 4,
n_test_entries_per_environment = 36,
target_replications = 1,
n_common_treatments = 8
)
# feasible = TRUE (difference = 0)Construction uses a greedy load-balanced constructor that guarantees equal replication.
allow_approximate = TRUE relaxes the slot identity and
accepts minor replication imbalances. This is a fallback for exploratory
use when exact slot identity cannot be achieved, not the primary
path.
6.3 M3 versus M4: choosing the right strategy
| Feature | M3 "random_balanced"
|
M4 "balanced_incomplete"
|
|---|---|---|
| Equal replication guarantee | No (approximate) | Yes (allow_approximate = FALSE) |
| Equal environment sizes | No | Yes (both modes) |
| Pairwise co-occurrence | Stochastic | Approximately uniform; compute from
$allocation_matrix %*% t($allocation_matrix)
|
| Handles unequal environment sizes | Yes | Only with allow_approximate = TRUE
|
| Requires slot identity \(J^* r = I k^*\) | No | Yes (strict mode) |
| Coverage guarantee | Yes (phase 1) | Yes |
| Typical use | Heterogeneous programs, exploratory | Standard MET networks with equal replication |
In practice, M3 is more flexible and tolerates heterogeneous
constraints. M4 with allow_approximate = FALSE (the
default) is the correct choice whenever equal replication is a hard
requirement — which it should be in most standard breeding programs.
Verify the slot identity first with
check_balanced_incomplete_feasibility() and adjust \(k\) or \(r\) so that \(J^*
\times r = I \times k^*\) holds. M4 with
allow_approximate = TRUE is a fallback when the slot
identity cannot be satisfied.
library(OptiSparseMET)
data("OptiSparseMET_example_data")
x <- OptiSparseMET_example_data
# Verify feasibility first
suggest_safe_k(x$treatments, x$environments,
common_treatments = x$common_treatments, buffer = 3)
# Returns 39: ceil(112/4) + 8 + 3 = 39
# M3: random balanced allocation
alloc_M3 <- allocate_sparse_met(
treatments = x$treatments,
environments = x$environments,
allocation_method = "random_balanced",
n_test_entries_per_environment = 39,
target_replications = 1,
common_treatments = x$common_treatments,
seed = 123
)
# M4: balanced incomplete allocation with equal replication.
# J*=112, I=4, r=1: need k* = 112/4 = 28, k = 28+8 = 36.
# Slot identity: 112*1 = 4*28 = 112. Verified above.
alloc_M4 <- allocate_sparse_met(
treatments = x$treatments,
environments = x$environments,
allocation_method = "balanced_incomplete",
n_test_entries_per_environment = 36,
target_replications = 1,
common_treatments = x$common_treatments,
allow_approximate = FALSE,
seed = 123
)
alloc_M3$summary # allocation_method, min/max/mean replication, etc.
alloc_M4$summary # min_sparse_replication = max_sparse_replication = 1 (equal)
alloc_M4$summary$max_sparse_replication # equals min_sparse_replication: equal replication confirmed6.4 Slot identity feasibility by J*, I, and r
The slot identity \(J^* \times r = I \times k^*\) requires that \(J^* \times r\) be exactly divisible by \(I\). Whether this is achievable for a given combination of sparse treatments (\(J^*\)), environments (\(I\)), and replication (\(r\)) depends on the shared factors of these three numbers.
The divisibility rule
For the slot identity to hold, \(I\) must divide \(J^* \times r\) exactly. Every prime factor of \(I\) that is absent from \(J^*\) must be supplied by \(r\). This has a practical consequence for the most common case in plant breeding:
\(I = 4\) environments, \(J^*\) odd: \(J^* \times 1 = \text{odd}\) (not divisible by 4); \(J^* \times 2 = 2 \times \text{odd}\) (divisible by 2 but not by 4 for odd \(J^*\)); only \(r = 4\) guarantees divisibility. But \(r = 4\) gives \(k^* = J^*\) — full replication — which defeats the purpose of sparse testing. Practical fix: adjust \(C\) by 1 so that \(J^* = J - C\) becomes even.
\(I = 4\) environments, \(J^*\) even but not divisible by 4: \(r = 2\) always works (e.g. \(J^* = 110\): \(110 \times 2 / 4 = 55\)).
\(I = 3\) environments: feasibility depends on divisibility by 3. If \(J^*\) is divisible by 3, any \(r\) works. Otherwise \(r\) must be a multiple of 3.
\(I = 6\) environments: requires divisibility by \(2 \times 3 = 6\). Odd \(J^*\) not divisible by 3 requires \(r\) divisible by 6.
Feasibility table: \(r = 2\)
The table shows \(k^*\) when the
slot identity holds, and -- when it does not for that
combination. Add \(C\) (common
treatments) to \(k^*\) to get the
n_test_entries_per_environment argument.
| \(J^*\) | \(I=3\) | \(I=4\) | \(I=5\) | \(I=6\) | \(I=7\) | \(I=8\) | \(I=9\) | \(I=10\) |
|---|---|---|---|---|---|---|---|---|
| 60 | 40 | 30 | 24 | 20 | – | 15 | – | 12 |
| 70 | – | 35 | 28 | – | 20 | – | – | 14 |
| 75 | 50 | – | 30 | 25 | – | – | – | 15 |
| 80 | – | 40 | 32 | – | – | 20 | – | 16 |
| 90 | 60 | 45 | 36 | 30 | – | – | 20 | 18 |
| 100 | – | 50 | 40 | – | – | 25 | – | 20 |
| 110 | – | 55 | 44 | – | – | – | – | 22 |
| 112 | – | 56 | – | – | 32 | 28 | – | – |
| 120 | 80 | 60 | 48 | 40 | – | 30 | – | 24 |
| 150 | 100 | 75 | 60 | 50 | – | – | – | 30 |
| 200 | – | 100 | 80 | – | – | 50 | – | 40 |
Feasibility table: \(r = 3\)
| \(J^*\) | \(I=3\) | \(I=4\) | \(I=5\) | \(I=6\) | \(I=7\) | \(I=8\) | \(I=9\) | \(I=10\) |
|---|---|---|---|---|---|---|---|---|
| 60 | 60 | 45 | 36 | 30 | – | – | 20 | 18 |
| 70 | 70 | – | 42 | 35 | 30 | – | – | 21 |
| 75 | 75 | – | 45 | – | – | – | 25 | – |
| 80 | 80 | 60 | 48 | 40 | – | 30 | – | 24 |
| 90 | 90 | – | 54 | 45 | – | – | 30 | 27 |
| 100 | 100 | 75 | 60 | 50 | – | – | – | 30 |
| 110 | 110 | – | 66 | 55 | – | – | – | 33 |
| 112 | 112 | 84 | – | 56 | 48 | 42 | – | – |
| 120 | 120 | 90 | 72 | 60 | – | 45 | 40 | 36 |
| 150 | 150 | – | 90 | 75 | – | – | 50 | 45 |
| 200 | 200 | 150 | 120 | 100 | – | 75 | – | 60 |
What to do when your combination gives --
Use check_balanced_incomplete_feasibility() to diagnose
the problem and try one of these adjustments:
- Adjust \(C\) by 1: adding or removing one common treatment changes \(J^*\) by 1, which may make it divisible by \(I\) for the chosen \(r\).
- Try \(r = 2\) instead of \(r = 1\), or \(r = 3\) instead of \(r = 2\) — the extra factor may resolve the divisibility.
-
Use
random_balanced(M3) if exact equal replication is not essential. M3 does not require the slot identity and tolerates odd \(J^*\) freely. -
Use
allow_approximate = TRUEas a fallback — the allocation proceeds with the closest possible balance, accepting minor replication differences.
# Quick check: is your combination feasible?
# J* = 75 (odd), I = 4, r = 2 -- should give --
check_balanced_incomplete_feasibility(
n_treatments_total = 83, # J = J* + C = 75 + 8
n_environments = 4,
n_test_entries_per_environment = 30, # k* guess: 30 - 8 = 22, 4*22=88 != 75*2=150
target_replications = 2,
n_common_treatments = 8
)
# feasible = FALSE -> adjust
# Fix: change C from 8 to 9 -> J* = 74 (even), r=2: 74*2/4 = 37
check_balanced_incomplete_feasibility(
n_treatments_total = 83,
n_environments = 4,
n_test_entries_per_environment = 46, # k* = 37, k = 37+9 = 46
target_replications = 2,
n_common_treatments = 9
)
# feasible = TRUE7. Genetic Structure-Aware Allocation
Standard allocation treats lines as exchangeable — any line could go to any environment with equal probability. This is inadequate when:
- Genomic prediction is a downstream objective (related lines clustered in the same environments provide redundant rather than complementary information)
- Family structure is strong and environments differ in their genetic composition
- Genetic disconnectedness would bias G×E estimation
OptiSparseMET incorporates genetic structure through
three mechanisms, controlled by allocation_group_source in
allocate_sparse_met().
7.1 Family-based allocation ("Family")
Family labels (e.g., full-sib or half-sib families) are read from
treatment_info$Family. Lines from the same family are
treated as a group and the allocator distributes groups across
environments rather than concentrating them. This prevents any one
environment from containing a disproportionate fraction of one family’s
members, which would inflate within-family variance estimates and reduce
the effective diversity of each site’s test set.
7.2 GRM-based allocation ("GRM")
When marker data are available, the genomic relationship matrix (GRM) encodes pairwise genetic similarity between all candidate lines. The allocator performs PCA on the GRM, then clusters lines by their positions in genomic principal component space using k-means or hierarchical clustering. The resulting cluster labels guide allocation just as family labels do.
This approach is more precise than family-based allocation when families are heterogeneous in size or when marker data captures structure not visible from pedigree labels alone. It directly improves the conditions under which GBLUP models operate by ensuring that genomically similar lines are spread across environments.
7.3 Pedigree-based allocation ("A")
When genomic data are unavailable, the pedigree numerator relationship matrix \(A\) plays the same role as the GRM. PCA is performed on \(A\) and clustering proceeds identically. The resulting groups reflect pedigree relatedness rather than marker-estimated genomic similarity.
7.4 Why this matters for prediction
Under a GBLUP model, prediction accuracy for unobserved lines depends on their relatedness to observed lines. When genetically similar lines are clustered in the same environments, the model gets redundant replication of the same genetic profiles while other profiles go unobserved. Spreading related lines across environments ensures the full genetic diversity of the candidate set is represented at each site, maximizing the amount of novel genetic information available for calibrating the prediction model.
alloc_grm <- allocate_sparse_met(
treatments = x$treatments,
environments = x$environments,
allocation_method = "random_balanced",
n_test_entries_per_environment = 39,
target_replications = 1,
common_treatments = x$common_treatments,
allocation_group_source = "GRM",
GRM = x$OptiSparseMET_GRM,
treatment_info = x$treatment_info,
group_method = "kmeans",
group_seed = 42,
group_attempts = 25,
min_groups_per_environment = 4,
min_env_per_group = 2,
balance_groups_across_env = TRUE,
force_group_connectivity = TRUE,
seed = 123
)
alloc_grm$group_by_environment # how many lines per group per environment
alloc_grm$group_overlap_matrix # how many groups are shared across env pairs8. Common Treatments
Common treatments are lines assigned to every environment before sparse allocation begins. If \(C\) is the number of common treatments, each environment has its effective sparse capacity reduced:
\[k_e^* = k_e - C\]
and the number of sparse-allocatable lines is:
\[J^* = J - C\]
8.1 Why common treatments matter
Common treatments serve two statistically distinct purposes.
Design-based connectivity: Two environments that share no tested lines can only be linked through the model — through the assumed structure of \(K\). Common treatments provide direct comparisons that do not depend on any model assumption. This is important when environments are weakly genetically correlated or when the researcher wants robust cross-site inference.
Estimation stability: Common treatments appear in every environment, so their effects are estimated with the highest precision. They function as stable benchmarks against which the performance of sparse treatments is measured. In practice, they are often elite lines or standard checks whose behavior is already partially known.
For novice readers: Common treatments are like reference points in a measurement system. Even if two environments test completely different sets of experimental lines, they can still be compared through their shared reference lines — just as two labs can compare results if they all measure the same control sample.
8.2 When to include them
Include common treatments when: - Environments are expected to be weakly correlated (e.g., very different climate zones, different growing seasons) - Direct cross-environment comparisons need to be reliable without assuming a specific genetic covariance structure - A stable reference set is needed for cross-trial benchmarking
Minimize the number of common treatments when: - All environments are similar and \(K\)-based connectivity is sufficient - Sparse capacity is already tight and common treatments would leave too few slots for the experimental lines
9. Seed-Aware Replication
9.1 The feasibility constraint
A design that requests replication without accounting for available seed cannot be deployed. The feasibility condition for assigning replication \(r_i\) to line \(i\) in environment \(e\) is:
\[s_i \geq r_i \times q_e + b\]
where: - \(s_i\) = seeds available for line \(i\) - \(r_i\) = desired replication level - \(q_e\) = seeds required per plot in environment \(e\) - \(b\) = optional safety buffer (e.g., for germination losses)
9.2 Replication roles
assign_replication_by_seed() evaluates this condition
for every line and assigns one of three roles:
-
"p_rep"— sufficient seed for \(r_i\) plots; assigneddesired_replicationsplots -
"unreplicated"— sufficient seed for one plot but not fordesired_replicationsplots (downgraded) -
"excluded"— insufficient seed even for one plot
Three design modes control how these roles are assigned:
| Mode | "augmented" |
"p_rep" |
"rcbd_type" |
|---|---|---|---|
| Target | 1 plot per line | \(r\) plots for a feasible subset | \(r\) plots for all lines |
| Downgrade | Excluded if seed < \(q_e\) | Non-candidates get 1 plot | Yes, if shortage_action = "downgrade"
|
| P-rep selection | — | By priority (seed, order, random) | — |
seed_df <- data.frame(
Treatment = x$treatments,
SeedAvailable = sample(20:120, length(x$treatments), replace = TRUE)
)
rep_plan <- assign_replication_by_seed(
treatments = x$treatments,
seed_available = seed_df,
seed_required_per_plot = 10,
replication_mode = "p_rep",
desired_replications = 2,
priority = "seed_available",
max_prep = 20,
minimum_seed_buffer = 5,
shortage_action = "downgrade",
seed = 123
)
length(rep_plan$p_rep_treatments) # replicated at 2 reps
length(rep_plan$unreplicated_treatments) # assigned 1 plot (downgraded)
length(rep_plan$excluded_treatments) # insufficient seed
rep_plan$seed_summary # full per-line feasibility table10. Within-Environment Design Engines
10.1 met_prep_famoptg() — block-based repeated-check
designs
This engine constructs repeated-check block designs for environments where the primary local control mechanism is blocking (incomplete block designs). It supports three design classes:
Augmented repeated-check design
All non-check lines are unreplicated. Each block contains the same set of check treatments plus a unique subset of experimental lines. This design is appropriate for early screening stages when the number of candidates far exceeds available resources and seed is very limited. The checks allow estimation of block effects, which are subtracted to improve comparisons among the unreplicated lines.
Partially replicated (p-rep) design
A subset of lines — those with more seed or higher priority — is
replicated at desired_replications plots, distributed
across distinct blocks. The remainder appear unreplicated. P-rep designs
are efficient because they concentrate replication where it provides the
most information while still including a broad set of candidates.
RCBD-type repeated-check design
All non-check lines are replicated at a uniform level. This is the closest to a standard RCBD within a single environment, extended to include repeated checks in every block.
Structural guarantees enforced by
met_prep_famoptg():
- Check treatments appear in every block
- Replicated lines appear in distinct blocks (at most once per block)
- Unreplicated lines appear exactly once in the design
Optional features include family-based or GRM/A-based genetic dispersion optimization (minimizing the average relatedness of adjacent plots) and mixed-model efficiency evaluation (A, D, CDmean).
design_E1 <- met_prep_famoptg(
check_treatments = c("CHK1", "CHK2"),
check_families = c("CHECK", "CHECK"),
p_rep_treatments = rep_plan$p_rep_treatments,
p_rep_reps = rep_plan$p_rep_reps,
p_rep_families = x$treatment_info$Family[
match(rep_plan$p_rep_treatments,
x$treatment_info$Treatment)],
unreplicated_treatments = rep_plan$unreplicated_treatments,
unreplicated_families = x$treatment_info$Family[
match(rep_plan$unreplicated_treatments,
x$treatment_info$Treatment)],
n_blocks = 4L,
n_rows = 10L,
n_cols = 12L,
order = "row",
serpentine = TRUE,
seed = 123
)
design_E1$layout_matrix # plot-level field map
design_E1$field_book # row-level data frame ready for export10.2 met_alpha_rc_stream() — row-column alpha
designs
This engine constructs alpha row-column designs for large fields where both row and column gradients are present. It is most appropriate when:
- The field is large and rectangular
- Row and column variation is substantial relative to treatment differences
- A spatial residual model (AR1 or AR1×AR1) will be used at analysis
The design is built by streaming plots across the field in a specified traversal order (row-major, column-major, or serpentine). Each replicate occupies a contiguous block of rows or columns. Within each replicate, entries are partitioned into incomplete blocks, with check treatments inserted at regular intervals within every incomplete block.
Structural guarantees:
- Each entry appears exactly once per replicate
- Check treatments appear in every incomplete block
- Block sizes are constrained within
[min_block_size, max_block_size]
design_E2 <- met_alpha_rc_stream(
check_treatments = c("CHK1", "CHK2"),
check_families = c("CHECK", "CHECK"),
entry_treatments = rep_plan$p_rep_treatments,
entry_families = x$treatment_info$Family[
match(rep_plan$p_rep_treatments,
x$treatment_info$Treatment)],
n_reps = 2L,
n_rows = 10L,
n_cols = 12L,
min_block_size = 8L,
max_block_size = 12L,
order = "row",
serpentine = TRUE,
seed = 123
)
design_E2$layout_matrix
design_E2$field_book
design_E2$design_info # n_reps, n_blocks_per_rep, total_used_plots10.3 Design engine selection guide
Use met_prep_famoptg() when:
- Repeated check distribution across blocks is a structural requirement
- The design needs to support p-rep, augmented, or RCBD-type structure
- The field is subdivided into blocks but row-column spatial modeling is not the primary concern
Use met_alpha_rc_stream() when:
- The field is large and continuous, with both row and column gradients
- Spatial models (AR1 or AR1×AR1) will be used at analysis
- The design needs fixed-grid compatibility for automated plot management
The engines can be mixed across environments within the same trial.
11. Efficiency Evaluation
11.1 Block-based designs:
met_evaluate_famoptg_efficiency()
For designs produced by met_prep_famoptg(), efficiency
is evaluated under a model with a block random effect:
\[y = X\beta + Zg + Wb + e\]
where \(b \sim N(0,\, I\sigma_b^2)\)
is the block effect. The variance components are specified in
varcomp using sigma_b2 for the block variance.
Note that this model has a single block variance
component — there is no separate replicate or
incomplete-block-within-replicate structure.
The function computes A-criterion, D-criterion (for fixed treatment
effects), and CDmean (for random treatment effects), under any of three
residual structures: "IID", "AR1" (along rows
or columns), or "AR1xAR1" (spatial gradient in both
dimensions).
eff_E1 <- met_evaluate_famoptg_efficiency(
field_book = design_E1$field_book,
n_rows = 10L,
n_cols = 12L,
check_treatments = c("CHK1", "CHK2"),
varcomp = list(
sigma_e2 = 1.0, # residual variance
sigma_g2 = 0.5, # genetic variance (for CDmean)
sigma_b2 = 0.3, # block variance
sigma_r2 = 0.05, # row variance (for spatial structures)
sigma_c2 = 0.05 # column variance
),
treatment_effect = "random",
prediction_type = "IID",
residual_structure = "IID"
)
eff_E1$CDmean # in [0, 1]: higher is better
eff_E1$mean_PEV # mean prediction error variance
eff_E1$A # A-efficiency (1 / A_criterion): higher is better
eff_E1$D # D-efficiency: higher is better11.2 Row-column designs:
met_evaluate_alpha_efficiency()
For designs produced by met_alpha_rc_stream(),
efficiency is evaluated under a model with replicate and
incomplete-block-within-replicate random effects:
\[y = X\beta + Zg + W_r r + W_{b} b + e\]
where \(r \sim N(0,\,
I\sigma_{\text{rep}}^2)\) and \(b \sim
N(0,\, I\sigma_{\text{ib}}^2)\) are the replicate and incomplete
block effects respectively. The varcomp list must contain
sigma_rep2 and sigma_ib2 (not
sigma_b2). This two-level blocking structure reflects the
hierarchical design of alpha row-column layouts.
eff_E2 <- met_evaluate_alpha_efficiency(
field_book = design_E2$field_book,
n_rows = 10L,
n_cols = 12L,
check_treatments = c("CHK1", "CHK2"),
varcomp = list(
sigma_e2 = 1.0,
sigma_g2 = 0.5,
sigma_rep2 = 0.4, # replicate variance
sigma_ib2 = 0.2, # incomplete block within replicate
sigma_r2 = 0.05,
sigma_c2 = 0.05
),
treatment_effect = "random",
prediction_type = "IID",
residual_structure = "IID"
)
eff_E2$CDmean
eff_E2$mean_PEVKey distinction:
met_evaluate_famoptg_efficiency()usessigma_b2(one block level);met_evaluate_alpha_efficiency()usessigma_rep2+sigma_ib2(two block levels: replicate and incomplete block within replicate). Using the wrong function for the wrong design will produce an error or incorrect results.
12. Design Optimisation
Once a base design is constructed and its efficiency evaluated, the optimizer functions search for a better arrangement of the same treatments within the same blocking structure.
12.1 met_optimize_famoptg() — block-based
optimisation
met_optimize_famoptg() uses Random Restart (RS): it
generates multiple independent random designs, evaluates each, and
returns the best. All structural guarantees of
met_prep_famoptg() are enforced in every candidate
design.
opt_E1 <- met_optimize_famoptg(
# All arguments of met_prep_famoptg(), plus:
check_treatments = c("CHK1", "CHK2"),
check_families = c("CHECK", "CHECK"),
p_rep_treatments = rep_plan$p_rep_treatments,
p_rep_reps = rep_plan$p_rep_reps,
p_rep_families = x$treatment_info$Family[
match(rep_plan$p_rep_treatments,
x$treatment_info$Treatment)],
unreplicated_treatments = rep_plan$unreplicated_treatments,
unreplicated_families = x$treatment_info$Family[
match(rep_plan$unreplicated_treatments,
x$treatment_info$Treatment)],
n_blocks = 4L,
n_rows = 10L,
n_cols = 12L,
# Optimisation arguments:
treatment_effect = "random",
prediction_type = "IID",
criterion = "CDmean",
varcomp = list(sigma_e2=1, sigma_g2=0.5, sigma_b2=0.3,
sigma_r2=0.05, sigma_c2=0.05),
n_restarts = 20L,
seed = 123
)
opt_E1$optimization$best_score # best CDmean found
opt_E1$optimization$score_history # CDmean at each restart
opt_E1$optimization$n_failed # restarts that produced invalid designs
opt_E1$field_book # the best design's field book
opt_E1$efficiency # efficiency metrics of the best design12.2 met_optimize_alpha_rc() — row-column
optimisation
met_optimize_alpha_rc() supports three optimisation
algorithms:
RS (Random Restart): Same as
met_optimize_famoptg(). Generates multiple independent
candidates and keeps the best. Simple and parallelizable.
SA (Simulated Annealing): Starts from one design and
iteratively perturbs it. At each step, a worse design is accepted with
probability \(\exp(-\Delta / T)\),
where \(\Delta\) is the score decrease
and \(T\) is the current temperature.
As \(T\) decreases from
sa_temp_start to sa_temp_end, the algorithm
transitions from broad exploration to local refinement. SA is effective
when the fitness landscape has many local optima.
GA (Genetic Algorithm): Maintains a population of designs. At each generation, pairs of high-scoring designs are combined (“crossover”) and random perturbations are applied (“mutation”). Elite designs are carried forward unchanged. GA explores a broader search space than RS or SA for the same number of objective function evaluations, at the cost of higher implementation complexity.
# Simulated Annealing
opt_E2_sa <- met_optimize_alpha_rc(
check_treatments = c("CHK1", "CHK2"),
check_families = c("CHECK", "CHECK"),
entry_treatments = rep_plan$p_rep_treatments,
entry_families = x$treatment_info$Family[
match(rep_plan$p_rep_treatments,
x$treatment_info$Treatment)],
n_reps = 2L,
n_rows = 10L,
n_cols = 12L,
method = "SA",
treatment_effect = "random",
prediction_type = "IID",
criterion = "CDmean",
varcomp = list(sigma_e2=1, sigma_g2=0.5, sigma_rep2=0.4,
sigma_ib2=0.2, sigma_r2=0.05, sigma_c2=0.05),
sa_max_iter = 500L,
sa_temp_start = 1.0,
sa_temp_end = 0.01,
seed = 123
)
opt_E2_sa$optimization$best_score
opt_E2_sa$optimization$sa_params # SA-specific settings
opt_E2_sa$optimization$ga_params # NULL (not used)
# Genetic Algorithm
opt_E2_ga <- met_optimize_alpha_rc(
check_treatments = c("CHK1", "CHK2"),
check_families = c("CHECK", "CHECK"),
entry_treatments = rep_plan$p_rep_treatments,
entry_families = x$treatment_info$Family[
match(rep_plan$p_rep_treatments,
x$treatment_info$Treatment)],
n_reps = 2L,
n_rows = 10L,
n_cols = 12L,
method = "GA",
treatment_effect = "random",
prediction_type = "IID",
criterion = "CDmean",
varcomp = list(sigma_e2=1, sigma_g2=0.5, sigma_rep2=0.4,
sigma_ib2=0.2, sigma_r2=0.05, sigma_c2=0.05),
ga_pop_size = 20L,
ga_n_generations = 50L,
ga_elitism = 4L,
seed = 123
)
opt_E2_ga$optimization$best_score
opt_E2_ga$optimization$ga_params13. Full Pipeline Example
13.1 Step-by-step
library(OptiSparseMET)
data("OptiSparseMET_example_data")
x <- OptiSparseMET_example_data
# Stage 0: verify capacity
k_safe <- suggest_safe_k(x$treatments, x$environments,
common_treatments = x$common_treatments,
buffer = 3) # 39
# Stage 1: allocation
# k_safe = 39 does not satisfy the slot identity for M4 (112*1=112 ≠ 4*31=124).
# Use random_balanced (M3) here, or adjust k for M4 slot identity.
# For M4 with r=1: k* = 112/4 = 28, k = 36.
alloc <- allocate_sparse_met(
treatments = x$treatments,
environments = x$environments,
allocation_method = "random_balanced",
n_test_entries_per_environment = k_safe,
target_replications = 1,
common_treatments = x$common_treatments,
allocation_group_source = "Family",
treatment_info = x$treatment_info,
seed = 123
)
alloc$summary
alloc$overlap_matrix # environments x environments: how many lines shared
# Stage 2: seed-aware replication
rep_plan <- assign_replication_by_seed(
treatments = x$treatments,
seed_available = x$seed_info,
seed_required_per_plot = 8,
replication_mode = "p_rep",
desired_replications = 2,
max_prep = 15,
shortage_action = "downgrade",
seed = 123
)
# Stage 3: design per environment (example for Env_A)
design_A <- met_prep_famoptg(
check_treatments = x$env_design_specs$Env_A$check_treatments,
check_families = x$env_design_specs$Env_A$check_families,
p_rep_treatments = rep_plan$p_rep_treatments,
p_rep_reps = rep_plan$p_rep_reps,
p_rep_families = x$treatment_info$Family[
match(rep_plan$p_rep_treatments,
x$treatment_info$Treatment)],
unreplicated_treatments = rep_plan$unreplicated_treatments,
unreplicated_families = x$treatment_info$Family[
match(rep_plan$unreplicated_treatments,
x$treatment_info$Treatment)],
n_blocks = x$env_design_specs$Env_A$n_blocks,
n_rows = x$env_design_specs$Env_A$n_rows,
n_cols = x$env_design_specs$Env_A$n_cols,
seed = 123
)
# Stage 4: combine field books
met_book <- combine_met_fieldbooks(
field_books = list(Env_A = design_A$field_book),
local_designs = c(Env_A = "met_prep_famoptg"),
replication_modes = c(Env_A = "p_rep"),
sparse_method = "balanced_incomplete",
common_treatments = x$common_treatments
)
head(met_book)13.2 End-to-end pipeline
plan_sparse_met_design() executes all four stages in a
single call. Each environment’s local design is controlled by
env_design_specs, a named list where the
design field specifies "met_prep_famoptg" or
"met_alpha_rc_stream":
out <- plan_sparse_met_design(
treatments = x$treatments,
environments = x$environments,
allocation_method = "random_balanced",
n_test_entries_per_environment = 39,
target_replications = 1,
common_treatments = x$common_treatments,
env_design_specs = x$env_design_specs,
treatment_info = x$treatment_info,
seed_info = x$seed_info,
seed_required_per_plot = x$seed_required_per_plot,
seed = 123
)
names(out)
# "sparse_allocation" "environment_designs" "combined_field_book"
# "environment_summary" "group_environment_summary"
# "efficiency_summary" "summary" "seed_used"
head(out$combined_field_book)
out$environment_summary # one row per environment
out$efficiency_summary # efficiency metrics per environment13.3 Environment-specific design specifications
The env_design_specs list allows each environment to use
a different design engine and replication mode. The design
field routes to the correct constructor. Additional fields are passed
directly as arguments:
env_design_specs <- list(
# p-rep block design
Env_A = list(
design = "met_prep_famoptg",
replication_mode = "p_rep",
desired_replications = 2L,
max_prep = 15L,
shortage_action = "downgrade",
check_treatments = c("CHK1", "CHK2"),
check_families = c("CHECK", "CHECK"),
n_blocks = 4L,
n_rows = 10L,
n_cols = 12L,
cluster_source = "Family",
eval_efficiency = TRUE
),
# Row-column alpha design
Env_B = list(
design = "met_alpha_rc_stream",
check_treatments = c("CHK1", "CHK2"),
check_families = c("CHECK", "CHECK"),
n_reps = 2L,
n_rows = 10L,
n_cols = 12L,
min_entry_slots_per_block = 8L,
cluster_source = "Family",
eval_efficiency = TRUE
),
# Augmented repeated-check design
Env_C = list(
design = "met_prep_famoptg",
replication_mode = "augmented",
check_treatments = c("CHK1", "CHK2"),
check_families = c("CHECK", "CHECK"),
n_blocks = 4L,
n_rows = 10L,
n_cols = 10L,
cluster_source = "Family",
eval_efficiency = FALSE
)
)14. Connectivity and Genomic Prediction: Simulation Illustration
The benefit of better cross-environment connectivity can be explored through simulation. The following code generates sparse allocations under M3 and M4, simulates phenotypic data under a genomic model, fits a GBLUP model to each dataset, and compares the resulting prediction quality.
library(sommer)
library(MASS)
set.seed(1)
n_geno <- 200
n_env <- 4
# Simulate a genomic relationship matrix
G0 <- matrix(rnorm(n_geno^2), n_geno)
G0 <- tcrossprod(G0) / n_geno
diag(G0) <- diag(G0) + 0.5
simulate_MET <- function(allocation_matrix) {
geno_ids <- rownames(allocation_matrix)
env_ids <- colnames(allocation_matrix)
# Simulate genetic values from GRM-based covariance structure
g_eff <- MASS::mvrnorm(1, mu = rep(0, length(geno_ids)), Sigma = G0)
e_eff <- rnorm(length(env_ids), 0, 1)
dat <- expand.grid(Geno = geno_ids, Env = env_ids, stringsAsFactors = FALSE)
dat$present <- as.vector(allocation_matrix)
dat <- dat[dat$present == 1, ]
dat$y <- g_eff[match(dat$Geno, geno_ids)] +
e_eff[match(dat$Env, env_ids)] +
rnorm(nrow(dat), 0, 1)
dat
}
# M3 allocation: n=200, I=4, k=80, r~=1.6 (total=320, target=2 but approx)
alloc_M3 <- allocate_sparse_met(
treatments = paste0("L", seq_len(n_geno)),
environments = paste0("E", seq_len(n_env)),
allocation_method = "random_balanced",
n_test_entries_per_environment = 80,
target_replications = 2,
allow_approximate = TRUE,
seed = 1
)
# M4 allocation: equal replication (allow_approximate = FALSE).
# J=200, I=4, r=2: k = J*r/I = 200*2/4 = 100. Slot identity: 200*2=4*100=400.
alloc_M4 <- allocate_sparse_met(
treatments = paste0("L", seq_len(n_geno)),
environments = paste0("E", seq_len(n_env)),
allocation_method = "balanced_incomplete",
n_test_entries_per_environment = 100,
target_replications = 2,
allow_approximate = FALSE,
seed = 1
)
dat_M3 <- simulate_MET(alloc_M3$allocation_matrix)
dat_M4 <- simulate_MET(alloc_M4$allocation_matrix)
# Fit GBLUP model to each dataset
fit_M3 <- sommer::mmer(
y ~ Env,
random = ~ sommer::vsr(Geno, Gu = G0),
data = dat_M3
)
fit_M4 <- sommer::mmer(
y ~ Env,
random = ~ sommer::vsr(Geno, Gu = G0),
data = dat_M4
)
# Compare mean PEV: lower is better (more precise prediction)
mean(fit_M3$PevU$`u:Geno`$y)
mean(fit_M4$PevU$`u:Geno`$y)In many settings the more uniform co-occurrence structure of M4 leads to lower average PEV and stronger cross-environment connectivity. The magnitude of the advantage depends on the degree of imbalance, the genetic covariance structure, and the ratio of genetic to residual variance.
15. Design Component Summary
| Component | Governing principle | Package function(s) |
|---|---|---|
| Capacity verification | All treatments must be assignable at least once |
suggest_safe_k(),
min_k_for_full_coverage()
|
| Allocation | Balance replication depth against coverage breadth | allocate_sparse_met() |
| BIBD feasibility | Slot identity \(J^* \\times r = I \\times k^*\) must hold for equal replication | check_balanced_incomplete_feasibility() |
| Genetic structure | Prevent clustering; improve prediction conditions | derive_allocation_groups() |
| Connectivity | Common treatments provide model-free cross-site linkage |
common_treatments argument |
| Seed feasibility | Replication must be achievable with available seed | assign_replication_by_seed() |
| Block design | Control local heterogeneity within environments | met_prep_famoptg() |
| Row-column design | Control spatial gradients in large fixed-grid fields | met_alpha_rc_stream() |
| Efficiency evaluation | Quantify design quality before field deployment |
met_evaluate_famoptg_efficiency(),
met_evaluate_alpha_efficiency()
|
| Optimisation | Search for higher-efficiency arrangements |
met_optimize_famoptg(),
met_optimize_alpha_rc()
|
| Assembly | Produce a unified field book for all environments |
combine_met_fieldbooks(),
plan_sparse_met_design()
|
16. Conclusions
Sparse MET design is a two-level problem and requires a framework
that addresses both levels consistently. OptiSparseMET
provides the following:
At the allocation level: two complementary strategies (M3 for flexible approximate balance, M4 for design-based uniformity), diagnostic tools for verifying feasibility before any allocation is attempted, and genetic structure awareness through family, GRM, or pedigree-based grouping.
At the field design level: two design engines with
complementary strengths (met_prep_famoptg() for block-based
repeated-check layouts, met_alpha_rc_stream() for
fixed-grid row-column designs), efficiency evaluation under A, D, and
CDmean criteria, and criterion-driven optimisation via three methods
(RS, SA, GA).
Linking both levels: seed feasibility verification
ensures designs are operationally deployable; common treatment support
ensures design-based connectivity; the pipeline functions
(plan_sparse_met_design(),
combine_met_fieldbooks()) ensure that outputs from both
levels are consistent and assembled into a unified MET field book.
The result is a workflow that produces designs that are statistically grounded and operationally deployable — rather than optimizing one at the expense of the other.
17. References
Montesinos-López, O. A., Mosqueda-González, B. A., Salinas-Ruiz, J., Montesinos-López, A., & Crossa, J. (2023). Sparse multi-trait genomic prediction under balanced incomplete block design. The Plant Genome, 16, e20305. https://doi.org/10.1002/tpg2.20305
Rincent, R., Laloe, D., Nicolas, S., Altmann, T., Brunel, D., Revilla, P., Rodriguez, V. M., Moreno-Gonzalez, J., Melchinger, A., Bauer, E., Schoen, C.-C., Meyer, N., Giauffret, C., Bauland, C., Jamin, P., Laborde, J., Monod, H., Flament, P., Charcosset, A., & Moreau, L. (2012). Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: Comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics, 192, 715–728. https://doi.org/10.1534/genetics.112.141473