composite artboards for main Figs 2–5 and SuppFig1–6; merge Fig4/5, rename dynamics to Fig5 by priyanka9991 · Pull Request #74 · TheOmarLab/sctrial

priyanka9991 · 2026-03-31T17:18:54Z

Main figures
Figure 2 (figure2_melanoma_analysis.py): combined / multi-panel layout work (generate_combined_panel_Fig2), plus follow-up overlap/label fixes.
Figure 3 (figure3_robustness_benchmarking.py): updates (including use from composite flows where relevant).
Figure 4 (figure4_biological_discovery_multi_dataset.py): large expansion — merges former Fig 4 + Fig 5 into one script (commit message: “merge fig4/5 to new fig 4 and rename fig6 to fig5”).
Figure 5 (figure5_validation_dynamics.py): renamed from former Fig 6 dynamics script; layout/composite updates.
Removed figure5_multi_dataset.py (superseded by the new Fig 4).
run_all.py: registry updated so main figures 2–5 point at the new modules and titles (Fig 4 = biological discovery + multi-dataset; Fig 5 = validation & dynamics).

Supplementary figures (SuppFig1–6)
Each of supp_fig1 … supp_fig6 gains composite / full-artboard figure generation (multi-row layouts, shared styling, panel lettering), plus supporting refactors. Notable themes from commits:
SuppFig1: composite figure.
SuppFig4: composite + result caching.
SuppFig2 / SuppFig3: composite figures; later gene overlap / layout fixes.
SuppFig5 / SuppFig6: composite figures.

Benchmark infrastructure for systematic method comparison: - Hierarchical gamma-Poisson simulator calibrated from TNBC raw counts, validated against vaccine holdout (matched-gene library-size comparison) - 6 method runners: sctrial FE, edgeR-QLF, limma-voom, dreamlet, NEBULA, Wilcoxon paired (all via standardized interface) - Orchestrator supporting parallel execution across simulation grid - Metrics module with 11 metrics including non-convergence tracking and top-k Jaccard stability - Permutation testing, subsampling reproducibility, and ablation modules - Redesigned SF4 simulation panels (H-K) using existing results: power curves, calibration strips, faceted QQ plots - 18 passing tests covering simulator, runners, and orchestrator Simulator validated at both cell-level and pseudobulk-level against TNBC (calibration) and vaccine (holdout). Melanoma excluded from count-scale calibration (normalized data only).

- All R runners (edgeR, limma-voom, dreamlet, NEBULA) now use design-appropriate models: ~arm*visit for two-arm, ~visit with participant blocking for single-arm - Fixed "All samples belong to same group" error when running single-arm scenarios through edgeR/limma/dreamlet - Fixed sample-name mismatch between count matrix and metadata by using consistent sample IDs (S0, S1, ...) in both CSVs - Orchestrator now threads design_type through to all runners - Fixed missing pandas import in phase_ablation - Fixed aggregate_pseudobulk → pseudobulk_expression import in permutation.py and subsample.py - Fixed matched-gene library-size validation comparison

…ate on log-pseudobulk sctrial_fe and wilcoxon_paired were receiving raw count-scale pseudobulk means while edgeR/limma/dreamlet internally log-transform. This caused betas in the thousands vs ~0.5 for R methods, making cross-method bias comparisons invalid. Fix: orchestrator now log1p-transforms pseudobulk means before passing to sctrial_fe and wilcoxon_paired. sctrial_fe gains a from_pseudobulk path that runs OLS DiD with participant FE directly on the log-pseudobulk DataFrame. All 6 methods now report betas on comparable log-fold-change scale. Also includes: SLURM submission script, ablation calibration fix, stale docstring corrections, ChatGPT/CODEX review fixes for permutation and subsample modules.

- simulator.py: guard X_counts None before .sum/.mean/.var - ablation.py: guard fn None before calling - orchestrator.py: use module imports to avoid run name clashes - R runners: fix tmpdir str/Path type mismatch

Explicit ndarray type annotation and np.asarray wrapping for rng.uniform/rng.lognormal/np.clip return values.

Results are now flushed to disk every 20 iterations instead of only after all 200 complete. Protects against losing hours of computation if the job crashes mid-scenario.

edgeR: pass group argument to filterByExpr (prevents over-filtering), add robust=TRUE to estimateDisp and glmQLFit, ensure row alignment between counts and metadata. sctrial_fe: use HC1 SEs instead of cluster-robust when n_participants <= 20. Cluster-robust SEs with few clusters (fewer than parameters) are known to be overly conservative, inflating p-values.

HC1 drops within-participant dependence correction which is the core of the method. Keep cluster-robust SEs uniformly and report the known small-sample conservatism honestly rather than relaxing the variance estimator. The conservatism with few clusters (n<=20) is a real property of the method, not a bug to fix by changing estimators.

The participant FE model with T=2 periods has 18 parameters on 32 obs (n=8/arm), leaving only 14 residual df. This near-saturated model makes both cluster-robust SEs and wild cluster bootstrap unreliable (FPR=0 under the null). First-differencing (delta_i = post_i - pre_i) is numerically equivalent to FE with T=2 (Wooldridge Ch.14) but eliminates the saturation problem. Welch t-test on participant-level deltas gives FPR ~0.04 at n=8 — slightly conservative but well-calibrated.

The benchmark runner uses first-difference participant-level DiD, not the full FE regression from the manuscript. Renaming makes explicit that this is a benchmark surrogate for the same T=2 DiD estimand, not the literal manuscript inferential engine.

Adds the _fail_result helper that returns NaN results for failed genes, resolving the F821 ruff/mypy errors in CI.

Two changes fix the ultra-conservative edgeR behavior (FPR=0.002): 1. DGEList(counts=t(counts), group=group) instead of passing group only to filterByExpr. This ensures filterByExpr.DGEList recognizes the group structure instead of treating all samples as one group. 2. Remove robust=TRUE from estimateDisp and glmQLFit. With sparse simulated pseudobulk (many genes with >75% zeros), robust dispersion estimation over-shrinks, producing inflated p-values. Standard estimation gives well-calibrated null behavior (FPR=0.057, median p=0.48 across 20 iterations × 20 genes).

rpy2's embedded R corrupts in forked multiprocessing workers, causing edgeR to produce systematically conservative p-values (FPR=0.002 vs expected 0.05). All 4 R runners now call Rscript via subprocess, which spawns a fresh R process per call and avoids the fork corruption issue.

fork() inherits parent R/rpy2 state which corrupts R subprocess calls in workers, causing edgeR FPR=0.002 instead of 0.05. Confirmed by running _run_single_iteration directly (FPR=0.053) vs via mp.Pool with fork (FPR=0.002). Using mp.get_context("spawn") creates fresh worker processes without inherited state.

Scripts verify: parallel equivalence, observed-scale truth definition, runner calibration audit, and direct-vs-worker execution comparison. All 4 pass locally: edgeR/limma calibrated, truth definition valid, execution modes produce identical results.

R runners use subprocess+Rscript, not rpy2. Also add git commit hash and package path to job log for reproducibility.

Addresses reviewer concern that 50-gene/20%-signal benchmark is regime-specific. Tests whether dreamlet/NEBULA null-gene FPR inflation attenuates with larger panels and lower signal fractions. New sensitivity grid: Panel sizes: 50, 200, 500, 2000 genes Signal fractions: 1%, 5%, 10%, 20% + pure null per size = 20 scenarios × 200 iterations × 4 methods (two-arm only) Changes: - orchestrator.py: add build_sensitivity_grid() and run_sensitivity_benchmark() functions - run_benchmark.py: add --phase sensitivity entry point - slurm_sensitivity.sh: dedicated SLURM job (72h, 32 CPUs, 256GB) - supp_fig7_signal_fraction_sensitivity.py: new SF7 with 4 panels: A) FPR heatmap (method × panel size × signal fraction) B) FPR line plots per panel size C) Pure-null lambda_GC across panel sizes D) QQ contrast: 50g/20% vs 2000g/1% Smoke tested locally: all 4 methods produce valid results across 50-gene to 2000-gene panels. 2000-gene dreamlet ~65s/iter, NEBULA ~318s/iter; estimated HPC total ~4-8 hours with 30 workers.

- dreamlet: timeout 600s → 1800s for large-panel scenarios - NEBULA: timeout 1200s → 2400s for large-panel scenarios - New SLURM script slurm_sensitivity_2k_rerun.sh: re-runs only the 5 × 2000-gene scenarios with reduced parallelism (10 workers vs 30) to avoid memory pressure and subprocess timeouts

Replaces the seven original benchmark panels with publication-quality versions driven by the signal-fraction sensitivity dataset (11M rows, 4 panel sizes x 5 signal fractions x 4 methods). New panels: H Null-gene FPR curves faceted by panel size (50/200/500/2000) I Null-gene FPR heatmap (method x panel size x signal fraction) J Pure-null lambda_GC across panel sizes K QQ plots at a challenging regime (200 genes, 10% signal) L Pure-null FPR dot-and-whisker with Wilson 95% CIs M Effect-size estimation accuracy (bias + RMSE on signal genes) N Runtime scaling across panel sizes (log-log) Key findings surfaced by the new panels: - All four methods are nominally calibrated under pure null (lambda_GC ~= 1, FPR within 3-7% band) - dreamlet's null-gene FPR inflation scales with signal fraction and is independent of panel size (same inflation at 50 vs 2000 genes) - dreamlet also exhibits ~50% positive bias in effect-size estimation on signal genes, with 3-5x higher RMSE than sctrial, Wilcoxon, NEBULA - sctrial and Wilcoxon (Delta scores) are the only methods that maintain both nominal calibration and unbiased estimation across all tested regimes - sctrial and Wilcoxon are ~1000x faster than the R-based methods Implementation notes: - New helper functions: _compute_null_fpr_table, _compute_signal_power_table, _compute_signal_bias_rmse_table - Consistent styling via _method_style, _add_nominal_band, _style_axis - Data source updated to manuscript/benchmark/sensitivity/sensitivity_combined.csv - Removed the standalone supp_fig7_signal_fraction_sensitivity.py (content folded into SF4)

Panel H (FPR curves): Wilcoxon was hidden underneath sctrial at the nominal 5% line. Added tiny per-method x-offsets (-0.3 to +0.3 signal %-points) so overlapping methods are visible side-by-side. Panel I (FPR heatmap): y-axis signal-fraction labels were only drawn on the first subplot because sharey=True stripped them from the others. Switched to sharey=False so every subplot shows its own y-tick labels. The first subplot still carries the axis-label text. Panel L (pure-null calibration): the old horizontal dot-and-whisker with per-method y-offsets made all four panel sizes cluster visually on top of each other. Redesigned as a log-scale line plot: x = panel size, y = pure-null FPR, one line per method with 95% Wilson CIs. All four panel sizes are now distinct points along each line. Panel M (effect-size accuracy): the bias axis used a symmetric y-range that wasted the bottom half (no method has large negative bias), making bars look "cut in half". Switched to asymmetric [min, max] limits. The in-axis legend overlapped with dreamlet's tall bars; moved it to a figure-level legend above the subplots. sctrial/Wilcoxon/NEBULA bars remain near zero because those methods ARE essentially unbiased — that is the correct, honest visual encoding.

Panels H, J, L, and N previously plotted panel size (50/200/500/2000) or signal fraction (1/5/10/20%) at their literal numeric positions on a linear or log axis, which produced uneven visual spacing between the four discrete levels. Switched all four panels to categorical x-positions (0, 1, 2, 3) with the raw values used only as tick labels, so every subplot now has four evenly-spaced x-ticks regardless of value span. Also set explicit MultipleLocator on y-axes that relied on matplotlib auto-ticking: H: y step 0.1 on [0.0, 0.7] J: y step 0.05 on [0.90, 1.15] L: y step 0.01 on [0.025, 0.085] M (bias row): y step 0.05 M (RMSE row): y step 0.05 N: y log scale (decade ticks) unchanged Panel I (heatmap) and Panel K (QQ with shared axes) already have uniform tick grids so they are unchanged.

priyanka9991 and others added 30 commits March 23, 2026 14:31

generate_combined_panel_Fig2

5174815

add combined panel

00f3ff8

add composite fig - fig4

8207077

add composite fig for fig5

0807da9

add composite figure - fig6

a93b514

fix overlap

a5e7f28

add composite - supp fig1

fe76034

add composite figure and result caching - supp4

12601cb

add composit fig - supp5

462dbd9

add composite fig - supp6

6d65c9f

Fix 33 mypy type errors in benchmark package

f69143d

- simulator.py: guard X_counts None before .sum/.mean/.var - ablation.py: guard fn None before calling - orchestrator.py: use module imports to avoid run name clashes - R runners: fix tmpdir str/Path type mismatch

Fix remaining mypy type errors in benchmark simulator

52caeb3

Explicit ndarray type annotation and np.asarray wrapping for rng.uniform/rng.lognormal/np.clip return values.

Add incremental CSV writing to benchmark orchestrator

0672fd8

Results are now flushed to disk every 20 iterations instead of only after all 200 complete. Protects against losing hours of computation if the job crashes mid-scenario.

Fix missing _fail_result function in sctrial_did runner

2922504

Adds the _fail_result helper that returns NaN results for failed genes, resolving the F821 ruff/mypy errors in CI.

delete old panels

4fe8b44

merge fig4/5 to new fig 4 and rename fig6 to fig5

dba5b8a

Replace rpy2 smoke test with Rscript check in SLURM script

6483957

R runners use subprocess+Rscript, not rpy2. Also add git commit hash and package path to job log for reproducibility.

add composite fig - supp2

15d7971

add composite fig - supp3

7513a5a

priyanka9991 added 2 commits March 30, 2026 16:57

update

cc24486

fix gene overlap

695effc

priyanka9991 requested a review from MohamedOmar2020 March 31, 2026 17:18

priyanka9991 self-assigned this Mar 31, 2026

priyanka9991 added 2 commits March 31, 2026 10:42

Merge branch 'dev' into add/fig_panels

d466221

fix xtick panelE

7acb405

MohamedOmar2020 force-pushed the dev branch from 9cdcbae to 1a4cdf1 Compare April 1, 2026 17:03

priyanka9991 and others added 15 commits April 1, 2026 17:03

update suppfig4 panels

7a18905

fix legend and suptitle

3333db8

new panels

cab3740

Merge branch 'dev' into add/fig_panels

9bacb0c

merge benchmark/signal-fraction-sensitivity

c1389ba

add updated benchmarking panels H-N

5cec8f8

reorganize panels

bd0ef72

fix overlaps and layout

f6c7178

fix overlaps

58342d2

rearrange panels

b541da6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

composite artboards for main Figs 2–5 and SuppFig1–6; merge Fig4/5, rename dynamics to Fig5#74

composite artboards for main Figs 2–5 and SuppFig1–6; merge Fig4/5, rename dynamics to Fig5#74
priyanka9991 wants to merge 49 commits into
devfrom
add/fig_panels

priyanka9991 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

priyanka9991 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant