A calibrated machine-learning classifier for the pathogenicity of missense variants. It ensembles established in-silico predictors, validates with gene-grouped cross-validation, calibrates its probabilities, and maps them to ACMG/AMP computational evidence strengths (PP3 / BP4). It trains, evaluates, and classifies new variants.
- Input: a missense variant's in-silico features (e.g. REVEL score and population allele frequency; the design also supports CADD, AlphaMissense, conservation, SIFT/PolyPhen, etc.).
- Output: a calibrated probability of pathogenicity, a Pathogenic/Benign call, and an ACMG computational-evidence tier (PP3_Supporting … PP3_VeryStrong / BP4_Supporting … BP4_Strong).
- Scope: missense variants. It supplies the computational evidence component one input to a full clinical ACMG/AMP classification not a standalone clinical verdict.
- Model: gradient-boosted trees (LightGBM; scikit-learn
HistGradientBoostingClassifierfallback). Both handle the heavy missingness of in-silico scores natively. Monotone constraints enforce sensible directionality. - Anti-circularity (Grimm et al., Hum Mutat 2015): gene-grouped CV
(
GroupKFold) defeats type-1 circularity;--exclude-meta(primary features only) probes type-2 circularity. - Calibration: isotonic regression on out-of-fold predictions → trustworthy posteriors.
- Clinical mapping: posteriors → local likelihood ratios → ACMG evidence tiers (Tavtigian/Pejaver).
- Interpretability: SHAP (falls back to permutation importance).
pip install -r requirements.txtpython variant_classifier.py --mode demoLabels come from ClinVar (free); features from REVEL (free, no
institutional email). See DATASET_GUIDE.md for the full recipe.
# ClinVar VCF (labels + allele frequency) and REVEL CSV (predictor score), then:
python build_from_clinvar_revel.py \
--clinvar-vcf clinvar.vcf.gz \
--revel revel_with_transcript_ids \
--out clinvar_training.tsv \
--min-stars 2 --drop-empty-genesFor the full multi-score feature set (REVEL, CADD, AlphaMissense, conservation,
…), build_dataset.py ingests ClinVar annotated with dbNSFP (requires
dbNSFP's institutional-email academic download).
# evaluate (gene-grouped CV) and save the trained model + calibrator
python variant_classifier.py --mode real --data clinvar_training.tsv \
--save-model variant_model.joblib
# classify NEW variants (a TSV with the model's feature columns)
printf "gene\tREVEL_score\tgnomAD_AF\nBRCA1\t0.95\t0.000001\nCFTR\t0.08\t0.03\n" > new_variants.tsv
python variant_classifier.py --mode predict \
--load-model variant_model.joblib --data new_variants.tsv --out preds.tsvpreds.tsv gains three columns: pathogenic_prob, predicted_class,
acmg_evidence. (To score a genuinely novel variant you must first annotate it
with the same features REVEL score and allele frequency — exactly as the
training data was built.)
74,368 ClinVar 2-star missense variants across 7,431 genes (34.5% pathogenic); gene-grouped 5-fold cross-validation.
| Model | Features | AUROC | AUPRC |
|---|---|---|---|
| Full | REVEL + allele frequency | 0.989 | 0.981 |
| REVEL only | REVEL | 0.959 | 0.941 |
| Allele frequency only | allele frequency | 0.916 | 0.772 |
Calibration is good (Brier 0.039 → 0.036 after isotonic).
- Allele-frequency label leakage. ClinVar applies frequency-based ACMG criteria (BA1/BS1) when assigning many benign labels, so allele frequency is partly upstream of the label. Allele frequency alone reaching AUROC 0.92 but only AUPRC 0.77 is the signature of this leakage (it separates the easy bulk but ranks pathogenicity poorly).
- REVEL circularity (type-2). REVEL was trained on disease/neutral sets that overlap ClinVar's ascertainment, so evaluating it on ClinVar is inherently optimistic.
The REVEL-only result (AUROC 0.96, AUPRC 0.94) better reflects genuine predictive content. Reported numbers should be read with these caveats.
| File | Purpose |
|---|---|
variant_classifier.py |
train / evaluate (gene-grouped CV) / calibrate / save / predict |
build_from_clinvar_revel.py |
ClinVar (labels+AF) + REVEL (score) → training TSV recommended, free |
build_dataset.py |
ClinVar/dbNSFP → training TSV full multi-score feature set |
load_kaggle_clinvar.py |
alternative loader: Kaggle ClinVar VEP features + ClinVar labels |
DATASET_GUIDE.md |
data download & build recipes |
This repo ships code only. The data sources have their own terms
REVEL (free, non-commercial), AlphaMissense (CC BY-NC-SA), dbNSFP
(academic / non-commercial; some component scores restricted). Download them
yourself; do not redistribute them here. The synthetic --mode demo lets
anyone run the pipeline without any external data.
REVEL (Ioannidis 2016) · ClinPred (Alirezaie 2018) · CADD (Rentzsch 2019) · AlphaMissense (Cheng 2023) · EVE (Frazer 2021) · ESM1b (Brandes 2023) · Pejaver (2022) · Tavtigian (2018) · Grimm circularity (2015) · dbNSFP (Liu 2020).
MIT (code). See LICENSE.