composite artboards for main Figs 2–5 and SuppFig1–6; merge Fig4/5, rename dynamics to Fig5#74
Open
priyanka9991 wants to merge 49 commits into
Open
composite artboards for main Figs 2–5 and SuppFig1–6; merge Fig4/5, rename dynamics to Fig5#74priyanka9991 wants to merge 49 commits into
priyanka9991 wants to merge 49 commits into
Conversation
Benchmark infrastructure for systematic method comparison: - Hierarchical gamma-Poisson simulator calibrated from TNBC raw counts, validated against vaccine holdout (matched-gene library-size comparison) - 6 method runners: sctrial FE, edgeR-QLF, limma-voom, dreamlet, NEBULA, Wilcoxon paired (all via standardized interface) - Orchestrator supporting parallel execution across simulation grid - Metrics module with 11 metrics including non-convergence tracking and top-k Jaccard stability - Permutation testing, subsampling reproducibility, and ablation modules - Redesigned SF4 simulation panels (H-K) using existing results: power curves, calibration strips, faceted QQ plots - 18 passing tests covering simulator, runners, and orchestrator Simulator validated at both cell-level and pseudobulk-level against TNBC (calibration) and vaccine (holdout). Melanoma excluded from count-scale calibration (normalized data only).
- All R runners (edgeR, limma-voom, dreamlet, NEBULA) now use design-appropriate models: ~arm*visit for two-arm, ~visit with participant blocking for single-arm - Fixed "All samples belong to same group" error when running single-arm scenarios through edgeR/limma/dreamlet - Fixed sample-name mismatch between count matrix and metadata by using consistent sample IDs (S0, S1, ...) in both CSVs - Orchestrator now threads design_type through to all runners - Fixed missing pandas import in phase_ablation - Fixed aggregate_pseudobulk → pseudobulk_expression import in permutation.py and subsample.py - Fixed matched-gene library-size validation comparison
…ate on log-pseudobulk sctrial_fe and wilcoxon_paired were receiving raw count-scale pseudobulk means while edgeR/limma/dreamlet internally log-transform. This caused betas in the thousands vs ~0.5 for R methods, making cross-method bias comparisons invalid. Fix: orchestrator now log1p-transforms pseudobulk means before passing to sctrial_fe and wilcoxon_paired. sctrial_fe gains a from_pseudobulk path that runs OLS DiD with participant FE directly on the log-pseudobulk DataFrame. All 6 methods now report betas on comparable log-fold-change scale. Also includes: SLURM submission script, ablation calibration fix, stale docstring corrections, ChatGPT/CODEX review fixes for permutation and subsample modules.
- simulator.py: guard X_counts None before .sum/.mean/.var - ablation.py: guard fn None before calling - orchestrator.py: use module imports to avoid run name clashes - R runners: fix tmpdir str/Path type mismatch
Explicit ndarray type annotation and np.asarray wrapping for rng.uniform/rng.lognormal/np.clip return values.
Results are now flushed to disk every 20 iterations instead of only after all 200 complete. Protects against losing hours of computation if the job crashes mid-scenario.
edgeR: pass group argument to filterByExpr (prevents over-filtering), add robust=TRUE to estimateDisp and glmQLFit, ensure row alignment between counts and metadata. sctrial_fe: use HC1 SEs instead of cluster-robust when n_participants <= 20. Cluster-robust SEs with few clusters (fewer than parameters) are known to be overly conservative, inflating p-values.
HC1 drops within-participant dependence correction which is the core of the method. Keep cluster-robust SEs uniformly and report the known small-sample conservatism honestly rather than relaxing the variance estimator. The conservatism with few clusters (n<=20) is a real property of the method, not a bug to fix by changing estimators.
The participant FE model with T=2 periods has 18 parameters on 32 obs (n=8/arm), leaving only 14 residual df. This near-saturated model makes both cluster-robust SEs and wild cluster bootstrap unreliable (FPR=0 under the null). First-differencing (delta_i = post_i - pre_i) is numerically equivalent to FE with T=2 (Wooldridge Ch.14) but eliminates the saturation problem. Welch t-test on participant-level deltas gives FPR ~0.04 at n=8 — slightly conservative but well-calibrated.
The benchmark runner uses first-difference participant-level DiD, not the full FE regression from the manuscript. Renaming makes explicit that this is a benchmark surrogate for the same T=2 DiD estimand, not the literal manuscript inferential engine.
Adds the _fail_result helper that returns NaN results for failed genes, resolving the F821 ruff/mypy errors in CI.
Two changes fix the ultra-conservative edgeR behavior (FPR=0.002): 1. DGEList(counts=t(counts), group=group) instead of passing group only to filterByExpr. This ensures filterByExpr.DGEList recognizes the group structure instead of treating all samples as one group. 2. Remove robust=TRUE from estimateDisp and glmQLFit. With sparse simulated pseudobulk (many genes with >75% zeros), robust dispersion estimation over-shrinks, producing inflated p-values. Standard estimation gives well-calibrated null behavior (FPR=0.057, median p=0.48 across 20 iterations × 20 genes).
rpy2's embedded R corrupts in forked multiprocessing workers, causing edgeR to produce systematically conservative p-values (FPR=0.002 vs expected 0.05). All 4 R runners now call Rscript via subprocess, which spawns a fresh R process per call and avoids the fork corruption issue.
fork() inherits parent R/rpy2 state which corrupts R subprocess calls
in workers, causing edgeR FPR=0.002 instead of 0.05. Confirmed by
running _run_single_iteration directly (FPR=0.053) vs via mp.Pool
with fork (FPR=0.002). Using mp.get_context("spawn") creates fresh
worker processes without inherited state.
Scripts verify: parallel equivalence, observed-scale truth definition, runner calibration audit, and direct-vs-worker execution comparison. All 4 pass locally: edgeR/limma calibrated, truth definition valid, execution modes produce identical results.
R runners use subprocess+Rscript, not rpy2. Also add git commit hash and package path to job log for reproducibility.
Addresses reviewer concern that 50-gene/20%-signal benchmark is regime-specific. Tests whether dreamlet/NEBULA null-gene FPR inflation attenuates with larger panels and lower signal fractions. New sensitivity grid: Panel sizes: 50, 200, 500, 2000 genes Signal fractions: 1%, 5%, 10%, 20% + pure null per size = 20 scenarios × 200 iterations × 4 methods (two-arm only) Changes: - orchestrator.py: add build_sensitivity_grid() and run_sensitivity_benchmark() functions - run_benchmark.py: add --phase sensitivity entry point - slurm_sensitivity.sh: dedicated SLURM job (72h, 32 CPUs, 256GB) - supp_fig7_signal_fraction_sensitivity.py: new SF7 with 4 panels: A) FPR heatmap (method × panel size × signal fraction) B) FPR line plots per panel size C) Pure-null lambda_GC across panel sizes D) QQ contrast: 50g/20% vs 2000g/1% Smoke tested locally: all 4 methods produce valid results across 50-gene to 2000-gene panels. 2000-gene dreamlet ~65s/iter, NEBULA ~318s/iter; estimated HPC total ~4-8 hours with 30 workers.
- dreamlet: timeout 600s → 1800s for large-panel scenarios - NEBULA: timeout 1200s → 2400s for large-panel scenarios - New SLURM script slurm_sensitivity_2k_rerun.sh: re-runs only the 5 × 2000-gene scenarios with reduced parallelism (10 workers vs 30) to avoid memory pressure and subprocess timeouts
Replaces the seven original benchmark panels with publication-quality versions driven by the signal-fraction sensitivity dataset (11M rows, 4 panel sizes x 5 signal fractions x 4 methods). New panels: H Null-gene FPR curves faceted by panel size (50/200/500/2000) I Null-gene FPR heatmap (method x panel size x signal fraction) J Pure-null lambda_GC across panel sizes K QQ plots at a challenging regime (200 genes, 10% signal) L Pure-null FPR dot-and-whisker with Wilson 95% CIs M Effect-size estimation accuracy (bias + RMSE on signal genes) N Runtime scaling across panel sizes (log-log) Key findings surfaced by the new panels: - All four methods are nominally calibrated under pure null (lambda_GC ~= 1, FPR within 3-7% band) - dreamlet's null-gene FPR inflation scales with signal fraction and is independent of panel size (same inflation at 50 vs 2000 genes) - dreamlet also exhibits ~50% positive bias in effect-size estimation on signal genes, with 3-5x higher RMSE than sctrial, Wilcoxon, NEBULA - sctrial and Wilcoxon (Delta scores) are the only methods that maintain both nominal calibration and unbiased estimation across all tested regimes - sctrial and Wilcoxon are ~1000x faster than the R-based methods Implementation notes: - New helper functions: _compute_null_fpr_table, _compute_signal_power_table, _compute_signal_bias_rmse_table - Consistent styling via _method_style, _add_nominal_band, _style_axis - Data source updated to manuscript/benchmark/sensitivity/sensitivity_combined.csv - Removed the standalone supp_fig7_signal_fraction_sensitivity.py (content folded into SF4)
Panel H (FPR curves): Wilcoxon was hidden underneath sctrial at the nominal 5% line. Added tiny per-method x-offsets (-0.3 to +0.3 signal %-points) so overlapping methods are visible side-by-side. Panel I (FPR heatmap): y-axis signal-fraction labels were only drawn on the first subplot because sharey=True stripped them from the others. Switched to sharey=False so every subplot shows its own y-tick labels. The first subplot still carries the axis-label text. Panel L (pure-null calibration): the old horizontal dot-and-whisker with per-method y-offsets made all four panel sizes cluster visually on top of each other. Redesigned as a log-scale line plot: x = panel size, y = pure-null FPR, one line per method with 95% Wilson CIs. All four panel sizes are now distinct points along each line. Panel M (effect-size accuracy): the bias axis used a symmetric y-range that wasted the bottom half (no method has large negative bias), making bars look "cut in half". Switched to asymmetric [min, max] limits. The in-axis legend overlapped with dreamlet's tall bars; moved it to a figure-level legend above the subplots. sctrial/Wilcoxon/NEBULA bars remain near zero because those methods ARE essentially unbiased — that is the correct, honest visual encoding.
Panels H, J, L, and N previously plotted panel size (50/200/500/2000) or signal fraction (1/5/10/20%) at their literal numeric positions on a linear or log axis, which produced uneven visual spacing between the four discrete levels. Switched all four panels to categorical x-positions (0, 1, 2, 3) with the raw values used only as tick labels, so every subplot now has four evenly-spaced x-ticks regardless of value span. Also set explicit MultipleLocator on y-axes that relied on matplotlib auto-ticking: H: y step 0.1 on [0.0, 0.7] J: y step 0.05 on [0.90, 1.15] L: y step 0.01 on [0.025, 0.085] M (bias row): y step 0.05 M (RMSE row): y step 0.05 N: y log scale (decade ticks) unchanged Panel I (heatmap) and Panel K (QQ with shared axes) already have uniform tick grids so they are unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Main figures
Figure 2 (figure2_melanoma_analysis.py): combined / multi-panel layout work (generate_combined_panel_Fig2), plus follow-up overlap/label fixes.
Figure 3 (figure3_robustness_benchmarking.py): updates (including use from composite flows where relevant).
Figure 4 (figure4_biological_discovery_multi_dataset.py): large expansion — merges former Fig 4 + Fig 5 into one script (commit message: “merge fig4/5 to new fig 4 and rename fig6 to fig5”).
Figure 5 (figure5_validation_dynamics.py): renamed from former Fig 6 dynamics script; layout/composite updates.
Removed figure5_multi_dataset.py (superseded by the new Fig 4).
run_all.py: registry updated so main figures 2–5 point at the new modules and titles (Fig 4 = biological discovery + multi-dataset; Fig 5 = validation & dynamics).
Supplementary figures (SuppFig1–6)
Each of supp_fig1 … supp_fig6 gains composite / full-artboard figure generation (multi-row layouts, shared styling, panel lettering), plus supporting refactors. Notable themes from commits:
SuppFig1: composite figure.
SuppFig4: composite + result caching.
SuppFig2 / SuppFig3: composite figures; later gene overlap / layout fixes.
SuppFig5 / SuppFig6: composite figures.