Skip to content

Latest commit

 

History

History
247 lines (184 loc) · 13.6 KB

File metadata and controls

247 lines (184 loc) · 13.6 KB

Benchmarks — COVID Status Prediction (CITE-seq)

Task: donor-level binary classification (COVID+ vs COVID-)
Evaluation: 5-fold stratified donor CV throughout
Cell lineages: NK, T (CD4+CD8), B (B+PB)


10,000-cell sample

Random subsample of 10k cells from 174,753 total. Cells per donor are sparse — NK median 10, T median 30, B median 21 per donor — so pseudobulk means are noisy. RNA benchmark included here since it was cheap to compute at this scale.

Model Features n_features n_donors Acc AUC
Pseudobulk LR (ElasticNet) RNA, Reactome-filtered 28,980 39 0.571 ± 0.163 0.698 ± 0.190
Pseudobulk LR (ElasticNet) Protein-only (209 ab) 627 39 0.846 ± 0.049 0.876 ± 0.063
Pseudobulk XGBoost Protein-only (209 ab) 627 39 0.871 ± 0.079 0.905 ± 0.078
Per-lineage LR — NK only Protein 209 40 0.675 ± 0.150 0.593 ± 0.114
Per-lineage LR — T only Protein 209 41 0.806 ± 0.058 0.514 ± 0.029
Per-lineage LR — B only Protein 209 40 0.800 ± 0.170 0.917 ± 0.129
Cell composition only Fine cell type fractions 9 41 0.831 ± 0.125 0.881 ± 0.109
BINN GCN donor attention† RNA+Protein, Reactome 9,660 41 0.778

†Single 80/20 donor split (9 test donors), no CV — not directly comparable.


All cells (174,753)

Stable pseudobulk means — NK median 186, T median 474, B median 344 per donor. RNA computed from sparse matrix donor-by-donor (avoids 6.3 GB dense materialization).

Pseudobulk LR + XGBoost

Model Features n_features n_donors Acc AUC
PB LR (ElasticNet) RNA only (Reactome) 28,971 41 0.761 ± 0.117 0.571 ± 0.143
PB XGBoost RNA only (Reactome) 28,971 41 0.806 ± 0.058 0.860 ± 0.177
PB LR (ElasticNet) Protein only 627 41 0.761 ± 0.117 0.514 ± 0.029
PB XGBoost Protein only 627 41 0.781 ± 0.093 0.879 ± 0.105
PB LR (ElasticNet) RNA + Protein 29,463 41 0.783 ± 0.081 0.571 ± 0.143
PB XGBoost RNA + Protein 29,463 41 0.828 ± 0.064 0.821 ± 0.150

Per-lineage LR

Lineage RNA Acc RNA AUC Protein Acc Protein AUC
NK 0.806 ± 0.058 0.543 ± 0.086 0.783 ± 0.081 0.557 ± 0.114
T 0.806 ± 0.058 0.557 ± 0.114 0.783 ± 0.081 0.457 ± 0.086
B 0.806 ± 0.058 0.571 ± 0.143 0.806 ± 0.058 0.571 ± 0.143

Cell composition

Model Features n_donors Acc AUC
Cell composition only Fine cell type fractions 9 features, 41 donors 0.803 ± 0.065 0.900 ± 0.162

BINN (reference)

Model Features n_donors Acc AUC
BINN GCN donor attention† RNA+Protein, Reactome 41 0.778

†Single 80/20 split (9 test donors), no CV — not directly comparable.

BINN — all cells (5-fold donor CV, 3 trials)

RNA+Protein mix features, kNN=5, early stopping (val_size=0.2, patience=50).

Donor-level fusion via attention pooling (learned softmax over cells per donor).

Model Features n_features n_donors Acc AUC
BINN GCN (Reactome mask) RNA+Protein (mix) 9,661 41 0.910 ± 0.042 1.000 ± 0.000
GCN unconstrained RNA+Protein (mix) 9,661 41 0.936 ± 0.012 0.984 ± 0.023
GAT unconstrained RNA+Protein (mix) 9,661 41 0.837 ± 0.078 1.000 ± 0.000
BINN ANN (Reactome mask) RNA+Protein (mix) 9,661 41 0.803 ± 0.033 0.832 ± 0.034
ANN unconstrained RNA+Protein (mix) 9,661 41 0.892 ± 0.024 0.990 ± 0.014

Donor-level fusion via moments pooling (mean + variance of cell embeddings per donor, no learned weights).

Model Features n_features n_donors Acc AUC
BINN GCN moments (Reactome mask) RNA+Protein (mix) 9,661 41 0.697 ± 0.096 0.847 ± 0.086
BINN ANN moments (Reactome mask) RNA+Protein (mix) 9,661 41 0.588 ± 0.129 0.792 ± 0.080

Attention vs moments pooling: attention pooling substantially outperforms moments on both accuracy and AUC — GCN drops from 0.910 → 0.697 and ANN from 0.803 → 0.588 when switching to moments. The learned cell weighting in the attention mechanism is load-bearing: simply summarising the embedding distribution (mean/variance) discards which cells are most diagnostically informative for a given donor.


Key findings

1. Sample size badly inflated the 10k results.
Protein LR dropped from 0.846 → 0.761 and per-lineage AUCs collapsed completely (B cell AUC went 0.917 → 0.571). The 10k pseudobulk means were too noisy to trust. All future comparisons should use all cells.

2. Cell composition alone is the strongest single signal.
9 cell type fractions, no expression, gives 0.803 acc / 0.900 AUC. COVID shifts the immune landscape (plasmablast expansion, NK/T redistribution) enough that counting cell types predicts status almost as well as measuring protein levels.

3. XGBoost marginally beats LR on proteins (0.781 vs 0.761).
Non-linear interactions exist but are not the dominant signal.

4. Per-lineage AUCs are all weak on full data (0.46–0.57).
No single lineage protein profile cleanly separates COVID+ from COVID-. The signal requires combining lineages — consistent with the multi-modality motivation for BINN, but the combination is captured by cell composition (cell type fractions) just as well as by expression values.

5. BINN (0.778, single split) is roughly competitive with XGBoost (0.781 CV).
But BINN needs proper CV before this can be claimed confidently.



Tier-2 benchmarks: BINN architecture ablations

5-fold stratified donor CV. Early stopping on validation loss (patience=50, min_delta=1e-4), val_size=0.2 of train donors. Checkpoint restored via parameter tensor cloning (bypasses PyG Linear._save_to_state_dict which always writes weight rather than weight_orig/weight_mask, breaking the standard load_state_dict restore path for pruned modules).

The 10k and 50k v1/v2 tables below predate the early stopping fix (400 fixed epochs, no val split); results with early stopping are in the 50k v3 section.

10k-cell sample

Model Features Graph Pathway mask Acc AUC
BINN GCN RNA+Protein (Reactome) RNA PCA kNN Yes 0.806 ± 0.126 0.890 ± 0.124
GCN unconstrained RNA+Protein (Reactome) RNA PCA kNN No 0.881 ± 0.106 0.936 ± 0.062
GCN protein-only Protein (209 ab) Protein PCA kNN No 0.761 ± 0.162 0.831 ± 0.162

Fold detail (acc):

Model Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
BINN GCN (Reactome mask) 0.778 0.875 0.750 0.625 1.000
GCN unconstrained 0.778 1.000 1.000 0.750 0.875
GCN protein-only 0.556 0.875 1.000 0.625 0.750

50k-cell sample (v1 — protein names not properly resolved)

Cells per donor: NK ~930, T ~1,900, B ~1,700 (median). Protein features in this run used a simple normalization (strip -\d+, uppercase) that missed ~75% of proteins (51/224 matched Reactome). Protein-only benchmark used a flat synthetic pathway map.

Model Features Graph Pathway mask Acc AUC
BINN GCN RNA+Protein (Reactome) RNA PCA kNN Yes 0.903 ± 0.049 0.971 ± 0.057
GCN unconstrained RNA+Protein (Reactome) RNA PCA kNN No 0.883 ± 0.122 0.943 ± 0.114
GCN protein-only Protein (flat map) Protein PCA kNN No 0.633 ± 0.238 0.671 ± 0.178

50k-cell sample (v2 — CD antigen → HGNC synonym map applied)

Added protein_synonyms.py mapping 209 antibody panel names to HGNC gene symbols (CD3→CD3E, CD45→PTPRC, CD56→NCAM1, etc.). 182/224 antibodies now matched to Reactome (164 unique HGNC genes). Protein-only benchmark uses the Reactome-filtered protein set with the actual Reactome pathway map.

Feature sets: FEATURE_SET in _run_tier2.py controls "rna" / "prot" / "mix".

mix (RNA + protein, merged by HGNC name, RNA PCA kNN):

Model Features n_features Graph Mask Acc AUC
BINN GCN RNA+Protein (Reactome) 9,661 RNA PCA kNN Yes 0.903 ± 0.049 0.938 ± 0.055
GCN unconstrained RNA+Protein (Reactome) 9,661 RNA PCA kNN No 0.856 ± 0.039 0.957 ± 0.057
GCN protein-only Protein HGNC→Reactome 164 Protein PCA kNN Yes 0.731 ± 0.054 0.843 ± 0.106
BINN GAT RNA+Protein (Reactome) 9,661 RNA PCA kNN Yes 0.878 ± 0.006 0.950 ± 0.067
BINN ANN RNA+Protein (Reactome) 9,661 — (no graph) Yes 0.778 ± 0.097 0.971 ± 0.057

50k-cell sample (v3 — early stopping enabled)

Same setup as v2 (mix feature set, 9,661 features, RNA PCA kNN, 3 trials). Early stopping active: val_size=0.2, patience=50. Epochs actually used per fold ranged ~51–267 (vs fixed 400).

Active params = post-mask nonzero weights (masked models are pruned to ~25K; unconstrained retain all ~4.1M).

Why the 160× gap. The layer widths derived from the N_LEVELS=3 pathway map are 10,127 → 401 → 113 → 24 → 2. The first weight matrix alone is 10,127 × 401 = 4,060,927 entries — ~99% of all parameters. In an unconstrained model it is dense. The Reactome mask sets entry W[g, p] = 1 only if gene g is annotated to sub-pathway p in Reactome. Because 57% of genes belong to exactly one layer-1 pathway and the mean membership is 2.42, the mask has roughly 10,127 × 2.42 ≈ 24,500 nonzero entries out of 4,060,927 — a density of 0.6%. The remaining layers (401→113, 113→24, 24→2) are small enough that even at full density they contribute only ~48K params total, so their sparsity is a minor effect. The 160× gap is almost entirely one matrix: the gene→pathway projection layer.

Model Features n_features Graph Mask Acc AUC Active params
BINN GCN RNA+Protein (Reactome) 9,661 RNA PCA kNN Yes 0.815 ± 0.012 0.889 ± 0.043 25,609
GCN unconstrained RNA+Protein (Reactome) 9,661 RNA PCA kNN No 0.878 ± 0.018 0.926 ± 0.062 4,109,540
GCN protein-only Protein HGNC→Reactome 164 Protein PCA kNN Yes 0.656 ± 0.108 0.570 ± 0.016 666
BINN GAT RNA+Protein (Reactome) 9,661 RNA PCA kNN Yes 0.795 ± 0.066 0.808 ± 0.078 25,609
GAT unconstrained RNA+Protein (Reactome) 9,661 RNA PCA kNN No 0.821 ± 0.052 0.955 ± 0.012 4,109,539
BINN ANN RNA+Protein (Reactome) 9,661 — (no graph) Yes 0.692 ± 0.008 0.733 ± 0.148 26,147
ANN unconstrained RNA+Protein (Reactome) 9,661 — (no graph) No 0.830 ± 0.020 0.960 ± 0.023 4,110,078

Key findings

1. The pathway mask advantage seen at 50k (v2) was an overfitting artefact. At v2 (400 fixed epochs) BINN GCN led 0.903 vs 0.856. With proper early stopping (v3) the ordering reverses: unconstrained GCN 0.878 vs BINN GCN 0.815. The mask helped when models could overfit to donor identities over many epochs; with early stopping the unconstrained model generalises better. The mask-as-regulariser hypothesis is not supported.

2. Unconstrained GCN leads on both accuracy and AUC with early stopping. GCN free: 0.878 acc / 0.926 AUC vs BINN GCN: 0.815 acc / 0.889 AUC (v3). The v2 split between "mask wins accuracy, free wins AUC" collapsed once overfitting was controlled.

3. Protein-only GCN (Reactome-matched, 164 proteins) is substantially weaker. 0.656 acc / 0.570 AUC — well below both RNA+protein models. The protein kNN graph topology (protein PCA neighbours) carries less structural signal for donor classification than the RNA PCA kNN graph. Protein features are informative but need RNA graph structure.

4. Protein synonym mapping resolved 182/224 → 164 unique HGNC symbols. Previously only 51/224 matched. The improvement is due to CD antigen → HGNC lookup (CD3→CD3E, CD45→PTPRC, CD56→NCAM1, etc.). The remaining 42 unmapped antibodies use non-standard names (CADHERIN, INTEGRIN) or map to genes absent from Reactome.

5. mix vs rna: the 1 extra feature matters negligibly. mix has 9,661 vs rna's 9,660 features — the additional HGNC-resolved protein features that don't overlap RNA are extremely few. The signal improvement from proteins comes mainly from the synonym mapping enabling correct Reactome pathway routing.

6. All models overfit in v2 (train loss → 0 by epoch 40); early stopping fixes this. With only 41 donors donor-level classification is trivially memorised without regularisation. Early stopping (v3): val_size=0.2 of train donors, patience=50. The PyG Linear state dict incompatibility with pruned weights was resolved by checkpointing via {name: p.data.clone()} over model.named_parameters() rather than state_dict().

7. GAT stability advantage disappears with early stopping. v2 GAT: 0.878±0.006 (near-zero fold variance). v3 GAT: 0.795±0.066 (high variance, 2 points below BINN GCN). Fixed-epoch training happened to suppress GAT variance by averaging over a long tail; early stopping exposes fold sensitivity.

8. Graph structure is load-bearing: removing it (ANN) costs ~12 points accuracy. BINN ANN (pathway-masked MLP, no edges) drops to 0.692 acc vs GCN's 0.815. The kNN cell-similarity graph is not decorative — neighbourhood aggregation over the RNA PCA graph is a meaningful part of the signal. ANN's AUC (0.971) is paradoxically the best, suggesting it ranks donors correctly but is less confident at the decision boundary without graph context.