Skip to content

composite artboards for main Figs 2–5 and SuppFig1–6; merge Fig4/5, rename dynamics to Fig5#74

Open
priyanka9991 wants to merge 49 commits into
devfrom
add/fig_panels
Open

composite artboards for main Figs 2–5 and SuppFig1–6; merge Fig4/5, rename dynamics to Fig5#74
priyanka9991 wants to merge 49 commits into
devfrom
add/fig_panels

Conversation

@priyanka9991
Copy link
Copy Markdown
Collaborator

Main figures
Figure 2 (figure2_melanoma_analysis.py): combined / multi-panel layout work (generate_combined_panel_Fig2), plus follow-up overlap/label fixes.
Figure 3 (figure3_robustness_benchmarking.py): updates (including use from composite flows where relevant).
Figure 4 (figure4_biological_discovery_multi_dataset.py): large expansion — merges former Fig 4 + Fig 5 into one script (commit message: “merge fig4/5 to new fig 4 and rename fig6 to fig5”).
Figure 5 (figure5_validation_dynamics.py): renamed from former Fig 6 dynamics script; layout/composite updates.
Removed figure5_multi_dataset.py (superseded by the new Fig 4).
run_all.py: registry updated so main figures 2–5 point at the new modules and titles (Fig 4 = biological discovery + multi-dataset; Fig 5 = validation & dynamics).

Supplementary figures (SuppFig1–6)
Each of supp_fig1 … supp_fig6 gains composite / full-artboard figure generation (multi-row layouts, shared styling, panel lettering), plus supporting refactors. Notable themes from commits:
SuppFig1: composite figure.
SuppFig4: composite + result caching.
SuppFig2 / SuppFig3: composite figures; later gene overlap / layout fixes.
SuppFig5 / SuppFig6: composite figures.

priyanka9991 and others added 30 commits March 23, 2026 14:31
Benchmark infrastructure for systematic method comparison:
- Hierarchical gamma-Poisson simulator calibrated from TNBC raw counts,
  validated against vaccine holdout (matched-gene library-size comparison)
- 6 method runners: sctrial FE, edgeR-QLF, limma-voom, dreamlet, NEBULA,
  Wilcoxon paired (all via standardized interface)
- Orchestrator supporting parallel execution across simulation grid
- Metrics module with 11 metrics including non-convergence tracking
  and top-k Jaccard stability
- Permutation testing, subsampling reproducibility, and ablation modules
- Redesigned SF4 simulation panels (H-K) using existing results:
  power curves, calibration strips, faceted QQ plots
- 18 passing tests covering simulator, runners, and orchestrator

Simulator validated at both cell-level and pseudobulk-level against
TNBC (calibration) and vaccine (holdout). Melanoma excluded from
count-scale calibration (normalized data only).
- All R runners (edgeR, limma-voom, dreamlet, NEBULA) now use
  design-appropriate models: ~arm*visit for two-arm, ~visit with
  participant blocking for single-arm
- Fixed "All samples belong to same group" error when running
  single-arm scenarios through edgeR/limma/dreamlet
- Fixed sample-name mismatch between count matrix and metadata
  by using consistent sample IDs (S0, S1, ...) in both CSVs
- Orchestrator now threads design_type through to all runners
- Fixed missing pandas import in phase_ablation
- Fixed aggregate_pseudobulk → pseudobulk_expression import
  in permutation.py and subsample.py
- Fixed matched-gene library-size validation comparison
…ate on log-pseudobulk

sctrial_fe and wilcoxon_paired were receiving raw count-scale pseudobulk
means while edgeR/limma/dreamlet internally log-transform. This caused
betas in the thousands vs ~0.5 for R methods, making cross-method bias
comparisons invalid.

Fix: orchestrator now log1p-transforms pseudobulk means before passing
to sctrial_fe and wilcoxon_paired. sctrial_fe gains a from_pseudobulk
path that runs OLS DiD with participant FE directly on the log-pseudobulk
DataFrame. All 6 methods now report betas on comparable log-fold-change
scale.

Also includes: SLURM submission script, ablation calibration fix,
stale docstring corrections, ChatGPT/CODEX review fixes for permutation
and subsample modules.
- simulator.py: guard X_counts None before .sum/.mean/.var
- ablation.py: guard fn None before calling
- orchestrator.py: use module imports to avoid run name clashes
- R runners: fix tmpdir str/Path type mismatch
Explicit ndarray type annotation and np.asarray wrapping for
rng.uniform/rng.lognormal/np.clip return values.
Results are now flushed to disk every 20 iterations instead of
only after all 200 complete. Protects against losing hours of
computation if the job crashes mid-scenario.
edgeR: pass group argument to filterByExpr (prevents over-filtering),
add robust=TRUE to estimateDisp and glmQLFit, ensure row alignment
between counts and metadata.

sctrial_fe: use HC1 SEs instead of cluster-robust when n_participants
<= 20. Cluster-robust SEs with few clusters (fewer than parameters)
are known to be overly conservative, inflating p-values.
HC1 drops within-participant dependence correction which is the core
of the method. Keep cluster-robust SEs uniformly and report the
known small-sample conservatism honestly rather than relaxing the
variance estimator. The conservatism with few clusters (n<=20) is a
real property of the method, not a bug to fix by changing estimators.
The participant FE model with T=2 periods has 18 parameters on 32 obs
(n=8/arm), leaving only 14 residual df. This near-saturated model makes
both cluster-robust SEs and wild cluster bootstrap unreliable (FPR=0
under the null). First-differencing (delta_i = post_i - pre_i) is
numerically equivalent to FE with T=2 (Wooldridge Ch.14) but eliminates
the saturation problem. Welch t-test on participant-level deltas gives
FPR ~0.04 at n=8 — slightly conservative but well-calibrated.
The benchmark runner uses first-difference participant-level DiD, not
the full FE regression from the manuscript. Renaming makes explicit that
this is a benchmark surrogate for the same T=2 DiD estimand, not the
literal manuscript inferential engine.
Adds the _fail_result helper that returns NaN results for failed genes,
resolving the F821 ruff/mypy errors in CI.
Two changes fix the ultra-conservative edgeR behavior (FPR=0.002):

1. DGEList(counts=t(counts), group=group) instead of passing group
   only to filterByExpr. This ensures filterByExpr.DGEList recognizes
   the group structure instead of treating all samples as one group.

2. Remove robust=TRUE from estimateDisp and glmQLFit. With sparse
   simulated pseudobulk (many genes with >75% zeros), robust dispersion
   estimation over-shrinks, producing inflated p-values. Standard
   estimation gives well-calibrated null behavior (FPR=0.057, median
   p=0.48 across 20 iterations × 20 genes).
rpy2's embedded R corrupts in forked multiprocessing workers,
causing edgeR to produce systematically conservative p-values
(FPR=0.002 vs expected 0.05). All 4 R runners now call Rscript
via subprocess, which spawns a fresh R process per call and
avoids the fork corruption issue.
fork() inherits parent R/rpy2 state which corrupts R subprocess calls
in workers, causing edgeR FPR=0.002 instead of 0.05. Confirmed by
running _run_single_iteration directly (FPR=0.053) vs via mp.Pool
with fork (FPR=0.002). Using mp.get_context("spawn") creates fresh
worker processes without inherited state.
Scripts verify: parallel equivalence, observed-scale truth definition,
runner calibration audit, and direct-vs-worker execution comparison.
All 4 pass locally: edgeR/limma calibrated, truth definition valid,
execution modes produce identical results.
R runners use subprocess+Rscript, not rpy2. Also add git commit
hash and package path to job log for reproducibility.
@priyanka9991 priyanka9991 self-assigned this Mar 31, 2026
priyanka9991 and others added 15 commits April 1, 2026 17:03
Addresses reviewer concern that 50-gene/20%-signal benchmark is
regime-specific. Tests whether dreamlet/NEBULA null-gene FPR inflation
attenuates with larger panels and lower signal fractions.

New sensitivity grid:
  Panel sizes:     50, 200, 500, 2000 genes
  Signal fractions: 1%, 5%, 10%, 20% + pure null per size
  = 20 scenarios × 200 iterations × 4 methods (two-arm only)

Changes:
- orchestrator.py: add build_sensitivity_grid() and
  run_sensitivity_benchmark() functions
- run_benchmark.py: add --phase sensitivity entry point
- slurm_sensitivity.sh: dedicated SLURM job (72h, 32 CPUs, 256GB)
- supp_fig7_signal_fraction_sensitivity.py: new SF7 with 4 panels:
  A) FPR heatmap (method × panel size × signal fraction)
  B) FPR line plots per panel size
  C) Pure-null lambda_GC across panel sizes
  D) QQ contrast: 50g/20% vs 2000g/1%

Smoke tested locally: all 4 methods produce valid results across
50-gene to 2000-gene panels. 2000-gene dreamlet ~65s/iter,
NEBULA ~318s/iter; estimated HPC total ~4-8 hours with 30 workers.
- dreamlet: timeout 600s → 1800s for large-panel scenarios
- NEBULA: timeout 1200s → 2400s for large-panel scenarios
- New SLURM script slurm_sensitivity_2k_rerun.sh: re-runs only the
  5 × 2000-gene scenarios with reduced parallelism (10 workers vs 30)
  to avoid memory pressure and subprocess timeouts
Replaces the seven original benchmark panels with publication-quality
versions driven by the signal-fraction sensitivity dataset (11M rows,
4 panel sizes x 5 signal fractions x 4 methods).

New panels:
  H  Null-gene FPR curves faceted by panel size (50/200/500/2000)
  I  Null-gene FPR heatmap (method x panel size x signal fraction)
  J  Pure-null lambda_GC across panel sizes
  K  QQ plots at a challenging regime (200 genes, 10% signal)
  L  Pure-null FPR dot-and-whisker with Wilson 95% CIs
  M  Effect-size estimation accuracy (bias + RMSE on signal genes)
  N  Runtime scaling across panel sizes (log-log)

Key findings surfaced by the new panels:
- All four methods are nominally calibrated under pure null
  (lambda_GC ~= 1, FPR within 3-7% band)
- dreamlet's null-gene FPR inflation scales with signal fraction and
  is independent of panel size (same inflation at 50 vs 2000 genes)
- dreamlet also exhibits ~50% positive bias in effect-size estimation
  on signal genes, with 3-5x higher RMSE than sctrial, Wilcoxon, NEBULA
- sctrial and Wilcoxon (Delta scores) are the only methods that
  maintain both nominal calibration and unbiased estimation across all
  tested regimes
- sctrial and Wilcoxon are ~1000x faster than the R-based methods

Implementation notes:
- New helper functions: _compute_null_fpr_table,
  _compute_signal_power_table, _compute_signal_bias_rmse_table
- Consistent styling via _method_style, _add_nominal_band, _style_axis
- Data source updated to manuscript/benchmark/sensitivity/sensitivity_combined.csv
- Removed the standalone supp_fig7_signal_fraction_sensitivity.py (content
  folded into SF4)
Panel H (FPR curves): Wilcoxon was hidden underneath sctrial at the
  nominal 5% line. Added tiny per-method x-offsets (-0.3 to +0.3 signal
  %-points) so overlapping methods are visible side-by-side.

Panel I (FPR heatmap): y-axis signal-fraction labels were only drawn on
  the first subplot because sharey=True stripped them from the others.
  Switched to sharey=False so every subplot shows its own y-tick labels.
  The first subplot still carries the axis-label text.

Panel L (pure-null calibration): the old horizontal dot-and-whisker with
  per-method y-offsets made all four panel sizes cluster visually on top
  of each other. Redesigned as a log-scale line plot: x = panel size,
  y = pure-null FPR, one line per method with 95% Wilson CIs. All four
  panel sizes are now distinct points along each line.

Panel M (effect-size accuracy): the bias axis used a symmetric y-range
  that wasted the bottom half (no method has large negative bias), making
  bars look "cut in half". Switched to asymmetric [min, max] limits.
  The in-axis legend overlapped with dreamlet's tall bars; moved it to
  a figure-level legend above the subplots. sctrial/Wilcoxon/NEBULA bars
  remain near zero because those methods ARE essentially unbiased —
  that is the correct, honest visual encoding.
Panels H, J, L, and N previously plotted panel size (50/200/500/2000)
or signal fraction (1/5/10/20%) at their literal numeric positions on
a linear or log axis, which produced uneven visual spacing between the
four discrete levels. Switched all four panels to categorical x-positions
(0, 1, 2, 3) with the raw values used only as tick labels, so every
subplot now has four evenly-spaced x-ticks regardless of value span.

Also set explicit MultipleLocator on y-axes that relied on matplotlib
auto-ticking:
  H: y step 0.1 on [0.0, 0.7]
  J: y step 0.05 on [0.90, 1.15]
  L: y step 0.01 on [0.025, 0.085]
  M (bias row): y step 0.05
  M (RMSE row): y step 0.05
  N: y log scale (decade ticks) unchanged

Panel I (heatmap) and Panel K (QQ with shared axes) already have
uniform tick grids so they are unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant