Add signal-fraction sensitivity benchmark by MohamedOmar2020 · Pull Request #75 · TheOmarLab/sctrial

MohamedOmar2020 · 2026-04-02T23:23:16Z

Summary

Adds signal-fraction sensitivity analysis to determine whether dreamlet/NEBULA null-gene FPR inflation is regime-specific to small panels with high signal fractions
New sensitivity grid: 4 panel sizes (50, 200, 500, 2000 genes) × 4 signal fractions (1%, 5%, 10%, 20%) + pure null = 20 scenarios
Dedicated SLURM script (slurm_sensitivity.sh) and benchmark phase (--phase sensitivity)
New Supplementary Figure 7 with 4 panels: FPR heatmap, FPR line plots, pure-null lambda_GC, QQ contrast

Changes

src/sctrial/benchmark/orchestrator.py: build_sensitivity_grid() + run_sensitivity_benchmark()
scripts/run_benchmark.py: --phase sensitivity entry point
scripts/slurm_sensitivity.sh: 72h SLURM job (2000-gene NEBULA is slow)
manuscript_figures/supp/supp_fig7_signal_fraction_sensitivity.py: SF7 panels A-D

Smoke test results

All 4 methods validated on 50-gene and 2000-gene panels:

2000-gene dreamlet: ~65s/iter, NEBULA: ~318s/iter
Estimated HPC total: 4-8 hours with 30 workers
No NaN, no crashes, calibration looks correct in single-iteration spot checks

Test plan

Smoke test: single iteration, 50-gene panel, all 4 methods
Smoke test: single iteration, 2000-gene panel, sctrial + wilcoxon
Smoke test: single iteration, 2000-gene panel, dreamlet only
Smoke test: single iteration, 2000-gene panel, NEBULA only
Smoke test: single iteration, 200-gene panel, all 4 methods
mypy clean
ruff clean
Full HPC run (20 scenarios × 200 iterations × 4 methods)
Generate SF7 panels from HPC results

Addresses reviewer concern that 50-gene/20%-signal benchmark is regime-specific. Tests whether dreamlet/NEBULA null-gene FPR inflation attenuates with larger panels and lower signal fractions. New sensitivity grid: Panel sizes: 50, 200, 500, 2000 genes Signal fractions: 1%, 5%, 10%, 20% + pure null per size = 20 scenarios × 200 iterations × 4 methods (two-arm only) Changes: - orchestrator.py: add build_sensitivity_grid() and run_sensitivity_benchmark() functions - run_benchmark.py: add --phase sensitivity entry point - slurm_sensitivity.sh: dedicated SLURM job (72h, 32 CPUs, 256GB) - supp_fig7_signal_fraction_sensitivity.py: new SF7 with 4 panels: A) FPR heatmap (method × panel size × signal fraction) B) FPR line plots per panel size C) Pure-null lambda_GC across panel sizes D) QQ contrast: 50g/20% vs 2000g/1% Smoke tested locally: all 4 methods produce valid results across 50-gene to 2000-gene panels. 2000-gene dreamlet ~65s/iter, NEBULA ~318s/iter; estimated HPC total ~4-8 hours with 30 workers.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3d84398b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-02T23:25:42Z

+FIGURE_NAME = "SuppFig7_signal_fraction_sensitivity"
+
+_SENSITIVITY_CSV = (
+    Path(__file__).resolve().parents[4]


Point sensitivity CSV path at project root

_SENSITIVITY_CSV uses Path(__file__).resolve().parents[4], which resolves to / for this file layout (.../sctrial/manuscript_figures/supp/...), so the script looks for /manuscript/benchmark/sensitivity/sensitivity_combined.csv instead of the repository path and raises FileNotFoundError in normal runs. This blocks Supplementary Figure 7 generation unless a matching absolute path happens to exist at the filesystem root.

Useful? React with 👍 / 👎.

- dreamlet: timeout 600s → 1800s for large-panel scenarios - NEBULA: timeout 1200s → 2400s for large-panel scenarios - New SLURM script slurm_sensitivity_2k_rerun.sh: re-runs only the 5 × 2000-gene scenarios with reduced parallelism (10 workers vs 30) to avoid memory pressure and subprocess timeouts

Replaces the seven original benchmark panels with publication-quality versions driven by the signal-fraction sensitivity dataset (11M rows, 4 panel sizes x 5 signal fractions x 4 methods). New panels: H Null-gene FPR curves faceted by panel size (50/200/500/2000) I Null-gene FPR heatmap (method x panel size x signal fraction) J Pure-null lambda_GC across panel sizes K QQ plots at a challenging regime (200 genes, 10% signal) L Pure-null FPR dot-and-whisker with Wilson 95% CIs M Effect-size estimation accuracy (bias + RMSE on signal genes) N Runtime scaling across panel sizes (log-log) Key findings surfaced by the new panels: - All four methods are nominally calibrated under pure null (lambda_GC ~= 1, FPR within 3-7% band) - dreamlet's null-gene FPR inflation scales with signal fraction and is independent of panel size (same inflation at 50 vs 2000 genes) - dreamlet also exhibits ~50% positive bias in effect-size estimation on signal genes, with 3-5x higher RMSE than sctrial, Wilcoxon, NEBULA - sctrial and Wilcoxon (Delta scores) are the only methods that maintain both nominal calibration and unbiased estimation across all tested regimes - sctrial and Wilcoxon are ~1000x faster than the R-based methods Implementation notes: - New helper functions: _compute_null_fpr_table, _compute_signal_power_table, _compute_signal_bias_rmse_table - Consistent styling via _method_style, _add_nominal_band, _style_axis - Data source updated to manuscript/benchmark/sensitivity/sensitivity_combined.csv - Removed the standalone supp_fig7_signal_fraction_sensitivity.py (content folded into SF4)

Panel H (FPR curves): Wilcoxon was hidden underneath sctrial at the nominal 5% line. Added tiny per-method x-offsets (-0.3 to +0.3 signal %-points) so overlapping methods are visible side-by-side. Panel I (FPR heatmap): y-axis signal-fraction labels were only drawn on the first subplot because sharey=True stripped them from the others. Switched to sharey=False so every subplot shows its own y-tick labels. The first subplot still carries the axis-label text. Panel L (pure-null calibration): the old horizontal dot-and-whisker with per-method y-offsets made all four panel sizes cluster visually on top of each other. Redesigned as a log-scale line plot: x = panel size, y = pure-null FPR, one line per method with 95% Wilson CIs. All four panel sizes are now distinct points along each line. Panel M (effect-size accuracy): the bias axis used a symmetric y-range that wasted the bottom half (no method has large negative bias), making bars look "cut in half". Switched to asymmetric [min, max] limits. The in-axis legend overlapped with dreamlet's tall bars; moved it to a figure-level legend above the subplots. sctrial/Wilcoxon/NEBULA bars remain near zero because those methods ARE essentially unbiased — that is the correct, honest visual encoding.

Panels H, J, L, and N previously plotted panel size (50/200/500/2000) or signal fraction (1/5/10/20%) at their literal numeric positions on a linear or log axis, which produced uneven visual spacing between the four discrete levels. Switched all four panels to categorical x-positions (0, 1, 2, 3) with the raw values used only as tick labels, so every subplot now has four evenly-spaced x-ticks regardless of value span. Also set explicit MultipleLocator on y-axes that relied on matplotlib auto-ticking: H: y step 0.1 on [0.0, 0.7] J: y step 0.05 on [0.90, 1.15] L: y step 0.01 on [0.025, 0.085] M (bias row): y step 0.05 M (RMSE row): y step 0.05 N: y log scale (decade ticks) unchanged Panel I (heatmap) and Panel K (QQ with shared axes) already have uniform tick grids so they are unchanged.

chatgpt-codex-connector Bot reviewed Apr 2, 2026

View reviewed changes

MohamedOmar2020 added 4 commits April 6, 2026 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add signal-fraction sensitivity benchmark#75

Add signal-fraction sensitivity benchmark#75
MohamedOmar2020 wants to merge 5 commits into
devfrom
benchmark/signal-fraction-sensitivity

MohamedOmar2020 commented Apr 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MohamedOmar2020 commented Apr 2, 2026

Summary

Changes

Smoke test results

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant