Benchmarks — COVID Status Prediction (CITE-seq)

Task: donor-level binary classification (COVID+ vs COVID-)
Evaluation: 5-fold stratified donor CV throughout
Cell lineages: NK, T (CD4+CD8), B (B+PB)

10,000-cell sample

Random subsample of 10k cells from 174,753 total. Cells per donor are sparse — NK median 10, T median 30, B median 21 per donor — so pseudobulk means are noisy. RNA benchmark included here since it was cheap to compute at this scale.

Model	Features	n_features	n_donors	Acc	AUC
Pseudobulk LR (ElasticNet)	RNA, Reactome-filtered	28,980	39	0.571 ± 0.163	0.698 ± 0.190
Pseudobulk LR (ElasticNet)	Protein-only (209 ab)	627	39	0.846 ± 0.049	0.876 ± 0.063
Pseudobulk XGBoost	Protein-only (209 ab)	627	39	0.871 ± 0.079	0.905 ± 0.078
Per-lineage LR — NK only	Protein	209	40	0.675 ± 0.150	0.593 ± 0.114
Per-lineage LR — T only	Protein	209	41	0.806 ± 0.058	0.514 ± 0.029
Per-lineage LR — B only	Protein	209	40	0.800 ± 0.170	0.917 ± 0.129
Cell composition only	Fine cell type fractions	9	41	0.831 ± 0.125	0.881 ± 0.109
BINN GCN donor attention†	RNA+Protein, Reactome	9,660	41	0.778	—

†Single 80/20 donor split (9 test donors), no CV — not directly comparable.

All cells (174,753)

Stable pseudobulk means — NK median 186, T median 474, B median 344 per donor. RNA computed from sparse matrix donor-by-donor (avoids 6.3 GB dense materialization).

Pseudobulk LR + XGBoost

Model	Features	n_features	n_donors	Acc	AUC
PB LR (ElasticNet)	RNA only (Reactome)	28,971	41	0.761 ± 0.117	0.571 ± 0.143
PB XGBoost	RNA only (Reactome)	28,971	41	0.806 ± 0.058	0.860 ± 0.177
PB LR (ElasticNet)	Protein only	627	41	0.761 ± 0.117	0.514 ± 0.029
PB XGBoost	Protein only	627	41	0.781 ± 0.093	0.879 ± 0.105
PB LR (ElasticNet)	RNA + Protein	29,463	41	0.783 ± 0.081	0.571 ± 0.143
PB XGBoost	RNA + Protein	29,463	41	0.828 ± 0.064	0.821 ± 0.150

Per-lineage LR

Lineage	RNA Acc	RNA AUC	Protein Acc	Protein AUC
NK	0.806 ± 0.058	0.543 ± 0.086	0.783 ± 0.081	0.557 ± 0.114
T	0.806 ± 0.058	0.557 ± 0.114	0.783 ± 0.081	0.457 ± 0.086
B	0.806 ± 0.058	0.571 ± 0.143	0.806 ± 0.058	0.571 ± 0.143

Cell composition

Model	Features	n_donors	Acc	AUC
Cell composition only	Fine cell type fractions	9 features, 41 donors	0.803 ± 0.065	0.900 ± 0.162

BINN (reference)

Model	Features	n_donors	Acc	AUC
BINN GCN donor attention†	RNA+Protein, Reactome	41	0.778	—

†Single 80/20 split (9 test donors), no CV — not directly comparable.

BINN — all cells (5-fold donor CV, 3 trials)

RNA+Protein mix features, kNN=5, early stopping (val_size=0.2, patience=50).

Donor-level fusion via attention pooling (learned softmax over cells per donor).

Model	Features	n_features	n_donors	Acc	AUC
BINN GCN (Reactome mask)	RNA+Protein (mix)	9,661	41	0.910 ± 0.042	1.000 ± 0.000
GCN unconstrained	RNA+Protein (mix)	9,661	41	0.936 ± 0.012	0.984 ± 0.023
GAT unconstrained	RNA+Protein (mix)	9,661	41	0.837 ± 0.078	1.000 ± 0.000
BINN ANN (Reactome mask)	RNA+Protein (mix)	9,661	41	0.803 ± 0.033	0.832 ± 0.034
ANN unconstrained	RNA+Protein (mix)	9,661	41	0.892 ± 0.024	0.990 ± 0.014

Donor-level fusion via moments pooling (mean + variance of cell embeddings per donor, no learned weights).

Model	Features	n_features	n_donors	Acc	AUC
BINN GCN moments (Reactome mask)	RNA+Protein (mix)	9,661	41	0.697 ± 0.096	0.847 ± 0.086
BINN ANN moments (Reactome mask)	RNA+Protein (mix)	9,661	41	0.588 ± 0.129	0.792 ± 0.080

Attention vs moments pooling: attention pooling substantially outperforms moments on both accuracy and AUC — GCN drops from 0.910 → 0.697 and ANN from 0.803 → 0.588 when switching to moments. The learned cell weighting in the attention mechanism is load-bearing: simply summarising the embedding distribution (mean/variance) discards which cells are most diagnostically informative for a given donor.

Key findings

1. Sample size badly inflated the 10k results.
Protein LR dropped from 0.846 → 0.761 and per-lineage AUCs collapsed completely (B cell AUC went 0.917 → 0.571). The 10k pseudobulk means were too noisy to trust. All future comparisons should use all cells.

2. Cell composition alone is the strongest single signal.
9 cell type fractions, no expression, gives 0.803 acc / 0.900 AUC. COVID shifts the immune landscape (plasmablast expansion, NK/T redistribution) enough that counting cell types predicts status almost as well as measuring protein levels.

3. XGBoost marginally beats LR on proteins (0.781 vs 0.761).
Non-linear interactions exist but are not the dominant signal.

4. Per-lineage AUCs are all weak on full data (0.46–0.57).
No single lineage protein profile cleanly separates COVID+ from COVID-. The signal requires combining lineages — consistent with the multi-modality motivation for BINN, but the combination is captured by cell composition (cell type fractions) just as well as by expression values.

5. BINN (0.778, single split) is roughly competitive with XGBoost (0.781 CV).
But BINN needs proper CV before this can be claimed confidently.

Tier-2 benchmarks: BINN architecture ablations

5-fold stratified donor CV. Early stopping on validation loss (patience=50, min_delta=1e-4), val_size=0.2 of train donors. Checkpoint restored via parameter tensor cloning (bypasses PyG Linear._save_to_state_dict which always writes weight rather than weight_orig/weight_mask, breaking the standard load_state_dict restore path for pruned modules).

The 10k and 50k v1/v2 tables below predate the early stopping fix (400 fixed epochs, no val split); results with early stopping are in the 50k v3 section.

10k-cell sample

Model	Features	Graph	Pathway mask	Acc	AUC
BINN GCN	RNA+Protein (Reactome)	RNA PCA kNN	Yes	0.806 ± 0.126	0.890 ± 0.124
GCN unconstrained	RNA+Protein (Reactome)	RNA PCA kNN	No	0.881 ± 0.106	0.936 ± 0.062
GCN protein-only	Protein (209 ab)	Protein PCA kNN	No	0.761 ± 0.162	0.831 ± 0.162

Fold detail (acc):

Model	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5
BINN GCN (Reactome mask)	0.778	0.875	0.750	0.625	1.000
GCN unconstrained	0.778	1.000	1.000	0.750	0.875
GCN protein-only	0.556	0.875	1.000	0.625	0.750

50k-cell sample (v1 — protein names not properly resolved)

Cells per donor: NK ~930, T ~1,900, B ~1,700 (median). Protein features in this run used a simple normalization (strip -\d+, uppercase) that missed ~75% of proteins (51/224 matched Reactome). Protein-only benchmark used a flat synthetic pathway map.

Model	Features	Graph	Pathway mask	Acc	AUC
BINN GCN	RNA+Protein (Reactome)	RNA PCA kNN	Yes	0.903 ± 0.049	0.971 ± 0.057
GCN unconstrained	RNA+Protein (Reactome)	RNA PCA kNN	No	0.883 ± 0.122	0.943 ± 0.114
GCN protein-only	Protein (flat map)	Protein PCA kNN	No	0.633 ± 0.238	0.671 ± 0.178

50k-cell sample (v2 — CD antigen → HGNC synonym map applied)

Added protein_synonyms.py mapping 209 antibody panel names to HGNC gene symbols (CD3→CD3E, CD45→PTPRC, CD56→NCAM1, etc.). 182/224 antibodies now matched to Reactome (164 unique HGNC genes). Protein-only benchmark uses the Reactome-filtered protein set with the actual Reactome pathway map.

Feature sets: FEATURE_SET in _run_tier2.py controls "rna" / "prot" / "mix".

mix (RNA + protein, merged by HGNC name, RNA PCA kNN):

Model	Features	n_features	Graph	Mask	Acc	AUC
BINN GCN	RNA+Protein (Reactome)	9,661	RNA PCA kNN	Yes	0.903 ± 0.049	0.938 ± 0.055
GCN unconstrained	RNA+Protein (Reactome)	9,661	RNA PCA kNN	No	0.856 ± 0.039	0.957 ± 0.057
GCN protein-only	Protein HGNC→Reactome	164	Protein PCA kNN	Yes	0.731 ± 0.054	0.843 ± 0.106
BINN GAT	RNA+Protein (Reactome)	9,661	RNA PCA kNN	Yes	0.878 ± 0.006	0.950 ± 0.067
BINN ANN	RNA+Protein (Reactome)	9,661	— (no graph)	Yes	0.778 ± 0.097	0.971 ± 0.057

50k-cell sample (v3 — early stopping enabled)

Same setup as v2 (mix feature set, 9,661 features, RNA PCA kNN, 3 trials). Early stopping active: val_size=0.2, patience=50. Epochs actually used per fold ranged ~51–267 (vs fixed 400).

Active params = post-mask nonzero weights (masked models are pruned to ~25K; unconstrained retain all ~4.1M).

Why the 160× gap. The layer widths derived from the N_LEVELS=3 pathway map are 10,127 → 401 → 113 → 24 → 2. The first weight matrix alone is 10,127 × 401 = 4,060,927 entries — ~99% of all parameters. In an unconstrained model it is dense. The Reactome mask sets entry W[g, p] = 1 only if gene g is annotated to sub-pathway p in Reactome. Because 57% of genes belong to exactly one layer-1 pathway and the mean membership is 2.42, the mask has roughly 10,127 × 2.42 ≈ 24,500 nonzero entries out of 4,060,927 — a density of 0.6%. The remaining layers (401→113, 113→24, 24→2) are small enough that even at full density they contribute only ~48K params total, so their sparsity is a minor effect. The 160× gap is almost entirely one matrix: the gene→pathway projection layer.

Model	Features	n_features	Graph	Mask	Acc	AUC	Active params
BINN GCN	RNA+Protein (Reactome)	9,661	RNA PCA kNN	Yes	0.815 ± 0.012	0.889 ± 0.043	25,609
GCN unconstrained	RNA+Protein (Reactome)	9,661	RNA PCA kNN	No	0.878 ± 0.018	0.926 ± 0.062	4,109,540
GCN protein-only	Protein HGNC→Reactome	164	Protein PCA kNN	Yes	0.656 ± 0.108	0.570 ± 0.016	666
BINN GAT	RNA+Protein (Reactome)	9,661	RNA PCA kNN	Yes	0.795 ± 0.066	0.808 ± 0.078	25,609
GAT unconstrained	RNA+Protein (Reactome)	9,661	RNA PCA kNN	No	0.821 ± 0.052	0.955 ± 0.012	4,109,539
BINN ANN	RNA+Protein (Reactome)	9,661	— (no graph)	Yes	0.692 ± 0.008	0.733 ± 0.148	26,147
ANN unconstrained	RNA+Protein (Reactome)	9,661	— (no graph)	No	0.830 ± 0.020	0.960 ± 0.023	4,110,078

Key findings

1. The pathway mask advantage seen at 50k (v2) was an overfitting artefact. At v2 (400 fixed epochs) BINN GCN led 0.903 vs 0.856. With proper early stopping (v3) the ordering reverses: unconstrained GCN 0.878 vs BINN GCN 0.815. The mask helped when models could overfit to donor identities over many epochs; with early stopping the unconstrained model generalises better. The mask-as-regulariser hypothesis is not supported.

2. Unconstrained GCN leads on both accuracy and AUC with early stopping. GCN free: 0.878 acc / 0.926 AUC vs BINN GCN: 0.815 acc / 0.889 AUC (v3). The v2 split between "mask wins accuracy, free wins AUC" collapsed once overfitting was controlled.

3. Protein-only GCN (Reactome-matched, 164 proteins) is substantially weaker. 0.656 acc / 0.570 AUC — well below both RNA+protein models. The protein kNN graph topology (protein PCA neighbours) carries less structural signal for donor classification than the RNA PCA kNN graph. Protein features are informative but need RNA graph structure.

4. Protein synonym mapping resolved 182/224 → 164 unique HGNC symbols. Previously only 51/224 matched. The improvement is due to CD antigen → HGNC lookup (CD3→CD3E, CD45→PTPRC, CD56→NCAM1, etc.). The remaining 42 unmapped antibodies use non-standard names (CADHERIN, INTEGRIN) or map to genes absent from Reactome.

5. mix vs rna: the 1 extra feature matters negligibly. mix has 9,661 vs rna's 9,660 features — the additional HGNC-resolved protein features that don't overlap RNA are extremely few. The signal improvement from proteins comes mainly from the synonym mapping enabling correct Reactome pathway routing.

6. All models overfit in v2 (train loss → 0 by epoch 40); early stopping fixes this. With only 41 donors donor-level classification is trivially memorised without regularisation. Early stopping (v3): val_size=0.2 of train donors, patience=50. The PyG Linear state dict incompatibility with pruned weights was resolved by checkpointing via {name: p.data.clone()} over model.named_parameters() rather than state_dict().

7. GAT stability advantage disappears with early stopping. v2 GAT: 0.878±0.006 (near-zero fold variance). v3 GAT: 0.795±0.066 (high variance, 2 points below BINN GCN). Fixed-epoch training happened to suppress GAT variance by averaging over a long tail; early stopping exposes fold sensitivity.

8. Graph structure is load-bearing: removing it (ANN) costs ~12 points accuracy. BINN ANN (pathway-masked MLP, no edges) drops to 0.692 acc vs GCN's 0.815. The kNN cell-similarity graph is not decorative — neighbourhood aggregation over the RNA PCA graph is a meaningful part of the signal. ANN's AUC (0.971) is paradoxically the best, suggesting it ranks donors correctly but is less confident at the decision boundary without graph context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarks — COVID Status Prediction (CITE-seq)

10,000-cell sample

All cells (174,753)

Pseudobulk LR + XGBoost

Per-lineage LR

Cell composition

BINN (reference)

BINN — all cells (5-fold donor CV, 3 trials)

Key findings

Tier-2 benchmarks: BINN architecture ablations

10k-cell sample

50k-cell sample (v1 — protein names not properly resolved)

50k-cell sample (v2 — CD antigen → HGNC synonym map applied)

50k-cell sample (v3 — early stopping enabled)

Key findings

Uh oh!

FilesExpand file tree

benchmarks.md

Latest commit

History

benchmarks.md

File metadata and controls

Benchmarks — COVID Status Prediction (CITE-seq)

10,000-cell sample

All cells (174,753)

Pseudobulk LR + XGBoost

Per-lineage LR

Cell composition

BINN (reference)

BINN — all cells (5-fold donor CV, 3 trials)

Key findings

Tier-2 benchmarks: BINN architecture ablations

10k-cell sample

50k-cell sample (v1 — protein names not properly resolved)

50k-cell sample (v2 — CD antigen → HGNC synonym map applied)

50k-cell sample (v3 — early stopping enabled)

Key findings