Add signal-fraction sensitivity benchmark#75
Conversation
Addresses reviewer concern that 50-gene/20%-signal benchmark is regime-specific. Tests whether dreamlet/NEBULA null-gene FPR inflation attenuates with larger panels and lower signal fractions. New sensitivity grid: Panel sizes: 50, 200, 500, 2000 genes Signal fractions: 1%, 5%, 10%, 20% + pure null per size = 20 scenarios × 200 iterations × 4 methods (two-arm only) Changes: - orchestrator.py: add build_sensitivity_grid() and run_sensitivity_benchmark() functions - run_benchmark.py: add --phase sensitivity entry point - slurm_sensitivity.sh: dedicated SLURM job (72h, 32 CPUs, 256GB) - supp_fig7_signal_fraction_sensitivity.py: new SF7 with 4 panels: A) FPR heatmap (method × panel size × signal fraction) B) FPR line plots per panel size C) Pure-null lambda_GC across panel sizes D) QQ contrast: 50g/20% vs 2000g/1% Smoke tested locally: all 4 methods produce valid results across 50-gene to 2000-gene panels. 2000-gene dreamlet ~65s/iter, NEBULA ~318s/iter; estimated HPC total ~4-8 hours with 30 workers.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f3d84398b5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| FIGURE_NAME = "SuppFig7_signal_fraction_sensitivity" | ||
|
|
||
| _SENSITIVITY_CSV = ( | ||
| Path(__file__).resolve().parents[4] |
There was a problem hiding this comment.
Point sensitivity CSV path at project root
_SENSITIVITY_CSV uses Path(__file__).resolve().parents[4], which resolves to / for this file layout (.../sctrial/manuscript_figures/supp/...), so the script looks for /manuscript/benchmark/sensitivity/sensitivity_combined.csv instead of the repository path and raises FileNotFoundError in normal runs. This blocks Supplementary Figure 7 generation unless a matching absolute path happens to exist at the filesystem root.
Useful? React with 👍 / 👎.
- dreamlet: timeout 600s → 1800s for large-panel scenarios - NEBULA: timeout 1200s → 2400s for large-panel scenarios - New SLURM script slurm_sensitivity_2k_rerun.sh: re-runs only the 5 × 2000-gene scenarios with reduced parallelism (10 workers vs 30) to avoid memory pressure and subprocess timeouts
Replaces the seven original benchmark panels with publication-quality versions driven by the signal-fraction sensitivity dataset (11M rows, 4 panel sizes x 5 signal fractions x 4 methods). New panels: H Null-gene FPR curves faceted by panel size (50/200/500/2000) I Null-gene FPR heatmap (method x panel size x signal fraction) J Pure-null lambda_GC across panel sizes K QQ plots at a challenging regime (200 genes, 10% signal) L Pure-null FPR dot-and-whisker with Wilson 95% CIs M Effect-size estimation accuracy (bias + RMSE on signal genes) N Runtime scaling across panel sizes (log-log) Key findings surfaced by the new panels: - All four methods are nominally calibrated under pure null (lambda_GC ~= 1, FPR within 3-7% band) - dreamlet's null-gene FPR inflation scales with signal fraction and is independent of panel size (same inflation at 50 vs 2000 genes) - dreamlet also exhibits ~50% positive bias in effect-size estimation on signal genes, with 3-5x higher RMSE than sctrial, Wilcoxon, NEBULA - sctrial and Wilcoxon (Delta scores) are the only methods that maintain both nominal calibration and unbiased estimation across all tested regimes - sctrial and Wilcoxon are ~1000x faster than the R-based methods Implementation notes: - New helper functions: _compute_null_fpr_table, _compute_signal_power_table, _compute_signal_bias_rmse_table - Consistent styling via _method_style, _add_nominal_band, _style_axis - Data source updated to manuscript/benchmark/sensitivity/sensitivity_combined.csv - Removed the standalone supp_fig7_signal_fraction_sensitivity.py (content folded into SF4)
Panel H (FPR curves): Wilcoxon was hidden underneath sctrial at the nominal 5% line. Added tiny per-method x-offsets (-0.3 to +0.3 signal %-points) so overlapping methods are visible side-by-side. Panel I (FPR heatmap): y-axis signal-fraction labels were only drawn on the first subplot because sharey=True stripped them from the others. Switched to sharey=False so every subplot shows its own y-tick labels. The first subplot still carries the axis-label text. Panel L (pure-null calibration): the old horizontal dot-and-whisker with per-method y-offsets made all four panel sizes cluster visually on top of each other. Redesigned as a log-scale line plot: x = panel size, y = pure-null FPR, one line per method with 95% Wilson CIs. All four panel sizes are now distinct points along each line. Panel M (effect-size accuracy): the bias axis used a symmetric y-range that wasted the bottom half (no method has large negative bias), making bars look "cut in half". Switched to asymmetric [min, max] limits. The in-axis legend overlapped with dreamlet's tall bars; moved it to a figure-level legend above the subplots. sctrial/Wilcoxon/NEBULA bars remain near zero because those methods ARE essentially unbiased — that is the correct, honest visual encoding.
Panels H, J, L, and N previously plotted panel size (50/200/500/2000) or signal fraction (1/5/10/20%) at their literal numeric positions on a linear or log axis, which produced uneven visual spacing between the four discrete levels. Switched all four panels to categorical x-positions (0, 1, 2, 3) with the raw values used only as tick labels, so every subplot now has four evenly-spaced x-ticks regardless of value span. Also set explicit MultipleLocator on y-axes that relied on matplotlib auto-ticking: H: y step 0.1 on [0.0, 0.7] J: y step 0.05 on [0.90, 1.15] L: y step 0.01 on [0.025, 0.085] M (bias row): y step 0.05 M (RMSE row): y step 0.05 N: y log scale (decade ticks) unchanged Panel I (heatmap) and Panel K (QQ with shared axes) already have uniform tick grids so they are unchanged.
Summary
slurm_sensitivity.sh) and benchmark phase (--phase sensitivity)Changes
src/sctrial/benchmark/orchestrator.py:build_sensitivity_grid()+run_sensitivity_benchmark()scripts/run_benchmark.py:--phase sensitivityentry pointscripts/slurm_sensitivity.sh: 72h SLURM job (2000-gene NEBULA is slow)manuscript_figures/supp/supp_fig7_signal_fraction_sensitivity.py: SF7 panels A-DSmoke test results
All 4 methods validated on 50-gene and 2000-gene panels:
Test plan