Skip to content

Add signal-fraction sensitivity benchmark#75

Open
MohamedOmar2020 wants to merge 5 commits into
devfrom
benchmark/signal-fraction-sensitivity
Open

Add signal-fraction sensitivity benchmark#75
MohamedOmar2020 wants to merge 5 commits into
devfrom
benchmark/signal-fraction-sensitivity

Conversation

@MohamedOmar2020
Copy link
Copy Markdown
Contributor

Summary

  • Adds signal-fraction sensitivity analysis to determine whether dreamlet/NEBULA null-gene FPR inflation is regime-specific to small panels with high signal fractions
  • New sensitivity grid: 4 panel sizes (50, 200, 500, 2000 genes) × 4 signal fractions (1%, 5%, 10%, 20%) + pure null = 20 scenarios
  • Dedicated SLURM script (slurm_sensitivity.sh) and benchmark phase (--phase sensitivity)
  • New Supplementary Figure 7 with 4 panels: FPR heatmap, FPR line plots, pure-null lambda_GC, QQ contrast

Changes

  • src/sctrial/benchmark/orchestrator.py: build_sensitivity_grid() + run_sensitivity_benchmark()
  • scripts/run_benchmark.py: --phase sensitivity entry point
  • scripts/slurm_sensitivity.sh: 72h SLURM job (2000-gene NEBULA is slow)
  • manuscript_figures/supp/supp_fig7_signal_fraction_sensitivity.py: SF7 panels A-D

Smoke test results

All 4 methods validated on 50-gene and 2000-gene panels:

  • 2000-gene dreamlet: ~65s/iter, NEBULA: ~318s/iter
  • Estimated HPC total: 4-8 hours with 30 workers
  • No NaN, no crashes, calibration looks correct in single-iteration spot checks

Test plan

  • Smoke test: single iteration, 50-gene panel, all 4 methods
  • Smoke test: single iteration, 2000-gene panel, sctrial + wilcoxon
  • Smoke test: single iteration, 2000-gene panel, dreamlet only
  • Smoke test: single iteration, 2000-gene panel, NEBULA only
  • Smoke test: single iteration, 200-gene panel, all 4 methods
  • mypy clean
  • ruff clean
  • Full HPC run (20 scenarios × 200 iterations × 4 methods)
  • Generate SF7 panels from HPC results

Addresses reviewer concern that 50-gene/20%-signal benchmark is
regime-specific. Tests whether dreamlet/NEBULA null-gene FPR inflation
attenuates with larger panels and lower signal fractions.

New sensitivity grid:
  Panel sizes:     50, 200, 500, 2000 genes
  Signal fractions: 1%, 5%, 10%, 20% + pure null per size
  = 20 scenarios × 200 iterations × 4 methods (two-arm only)

Changes:
- orchestrator.py: add build_sensitivity_grid() and
  run_sensitivity_benchmark() functions
- run_benchmark.py: add --phase sensitivity entry point
- slurm_sensitivity.sh: dedicated SLURM job (72h, 32 CPUs, 256GB)
- supp_fig7_signal_fraction_sensitivity.py: new SF7 with 4 panels:
  A) FPR heatmap (method × panel size × signal fraction)
  B) FPR line plots per panel size
  C) Pure-null lambda_GC across panel sizes
  D) QQ contrast: 50g/20% vs 2000g/1%

Smoke tested locally: all 4 methods produce valid results across
50-gene to 2000-gene panels. 2000-gene dreamlet ~65s/iter,
NEBULA ~318s/iter; estimated HPC total ~4-8 hours with 30 workers.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3d84398b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

FIGURE_NAME = "SuppFig7_signal_fraction_sensitivity"

_SENSITIVITY_CSV = (
Path(__file__).resolve().parents[4]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Point sensitivity CSV path at project root

_SENSITIVITY_CSV uses Path(__file__).resolve().parents[4], which resolves to / for this file layout (.../sctrial/manuscript_figures/supp/...), so the script looks for /manuscript/benchmark/sensitivity/sensitivity_combined.csv instead of the repository path and raises FileNotFoundError in normal runs. This blocks Supplementary Figure 7 generation unless a matching absolute path happens to exist at the filesystem root.

Useful? React with 👍 / 👎.

MohamedOmar2020 added 4 commits April 6, 2026 10:23
- dreamlet: timeout 600s → 1800s for large-panel scenarios
- NEBULA: timeout 1200s → 2400s for large-panel scenarios
- New SLURM script slurm_sensitivity_2k_rerun.sh: re-runs only the
  5 × 2000-gene scenarios with reduced parallelism (10 workers vs 30)
  to avoid memory pressure and subprocess timeouts
Replaces the seven original benchmark panels with publication-quality
versions driven by the signal-fraction sensitivity dataset (11M rows,
4 panel sizes x 5 signal fractions x 4 methods).

New panels:
  H  Null-gene FPR curves faceted by panel size (50/200/500/2000)
  I  Null-gene FPR heatmap (method x panel size x signal fraction)
  J  Pure-null lambda_GC across panel sizes
  K  QQ plots at a challenging regime (200 genes, 10% signal)
  L  Pure-null FPR dot-and-whisker with Wilson 95% CIs
  M  Effect-size estimation accuracy (bias + RMSE on signal genes)
  N  Runtime scaling across panel sizes (log-log)

Key findings surfaced by the new panels:
- All four methods are nominally calibrated under pure null
  (lambda_GC ~= 1, FPR within 3-7% band)
- dreamlet's null-gene FPR inflation scales with signal fraction and
  is independent of panel size (same inflation at 50 vs 2000 genes)
- dreamlet also exhibits ~50% positive bias in effect-size estimation
  on signal genes, with 3-5x higher RMSE than sctrial, Wilcoxon, NEBULA
- sctrial and Wilcoxon (Delta scores) are the only methods that
  maintain both nominal calibration and unbiased estimation across all
  tested regimes
- sctrial and Wilcoxon are ~1000x faster than the R-based methods

Implementation notes:
- New helper functions: _compute_null_fpr_table,
  _compute_signal_power_table, _compute_signal_bias_rmse_table
- Consistent styling via _method_style, _add_nominal_band, _style_axis
- Data source updated to manuscript/benchmark/sensitivity/sensitivity_combined.csv
- Removed the standalone supp_fig7_signal_fraction_sensitivity.py (content
  folded into SF4)
Panel H (FPR curves): Wilcoxon was hidden underneath sctrial at the
  nominal 5% line. Added tiny per-method x-offsets (-0.3 to +0.3 signal
  %-points) so overlapping methods are visible side-by-side.

Panel I (FPR heatmap): y-axis signal-fraction labels were only drawn on
  the first subplot because sharey=True stripped them from the others.
  Switched to sharey=False so every subplot shows its own y-tick labels.
  The first subplot still carries the axis-label text.

Panel L (pure-null calibration): the old horizontal dot-and-whisker with
  per-method y-offsets made all four panel sizes cluster visually on top
  of each other. Redesigned as a log-scale line plot: x = panel size,
  y = pure-null FPR, one line per method with 95% Wilson CIs. All four
  panel sizes are now distinct points along each line.

Panel M (effect-size accuracy): the bias axis used a symmetric y-range
  that wasted the bottom half (no method has large negative bias), making
  bars look "cut in half". Switched to asymmetric [min, max] limits.
  The in-axis legend overlapped with dreamlet's tall bars; moved it to
  a figure-level legend above the subplots. sctrial/Wilcoxon/NEBULA bars
  remain near zero because those methods ARE essentially unbiased —
  that is the correct, honest visual encoding.
Panels H, J, L, and N previously plotted panel size (50/200/500/2000)
or signal fraction (1/5/10/20%) at their literal numeric positions on
a linear or log axis, which produced uneven visual spacing between the
four discrete levels. Switched all four panels to categorical x-positions
(0, 1, 2, 3) with the raw values used only as tick labels, so every
subplot now has four evenly-spaced x-ticks regardless of value span.

Also set explicit MultipleLocator on y-axes that relied on matplotlib
auto-ticking:
  H: y step 0.1 on [0.0, 0.7]
  J: y step 0.05 on [0.90, 1.15]
  L: y step 0.01 on [0.025, 0.085]
  M (bias row): y step 0.05
  M (RMSE row): y step 0.05
  N: y log scale (decade ticks) unchanged

Panel I (heatmap) and Panel K (QQ with shared axes) already have
uniform tick grids so they are unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant