Skip to content

chumawinnie/variant-pathogenicity-classifier

Repository files navigation

Variant Pathogenicity Classifier

A calibrated machine-learning classifier for the pathogenicity of missense variants. It ensembles established in-silico predictors, validates with gene-grouped cross-validation, calibrates its probabilities, and maps them to ACMG/AMP computational evidence strengths (PP3 / BP4). It trains, evaluates, and classifies new variants.


What it does

  • Input: a missense variant's in-silico features (e.g. REVEL score and population allele frequency; the design also supports CADD, AlphaMissense, conservation, SIFT/PolyPhen, etc.).
  • Output: a calibrated probability of pathogenicity, a Pathogenic/Benign call, and an ACMG computational-evidence tier (PP3_Supporting … PP3_VeryStrong / BP4_Supporting … BP4_Strong).
  • Scope: missense variants. It supplies the computational evidence component one input to a full clinical ACMG/AMP classification not a standalone clinical verdict.

Methodology

  • Model: gradient-boosted trees (LightGBM; scikit-learn HistGradientBoostingClassifier fallback). Both handle the heavy missingness of in-silico scores natively. Monotone constraints enforce sensible directionality.
  • Anti-circularity (Grimm et al., Hum Mutat 2015): gene-grouped CV (GroupKFold) defeats type-1 circularity; --exclude-meta (primary features only) probes type-2 circularity.
  • Calibration: isotonic regression on out-of-fold predictions → trustworthy posteriors.
  • Clinical mapping: posteriors → local likelihood ratios → ACMG evidence tiers (Tavtigian/Pejaver).
  • Interpretability: SHAP (falls back to permutation importance).

Install

pip install -r requirements.txt

Quickstart (synthetic data, runs anywhere)

python variant_classifier.py --mode demo

Build a real training set (free, no institutional access)

Labels come from ClinVar (free); features from REVEL (free, no institutional email). See DATASET_GUIDE.md for the full recipe.

# ClinVar VCF (labels + allele frequency) and REVEL CSV (predictor score), then:
python build_from_clinvar_revel.py \
  --clinvar-vcf clinvar.vcf.gz \
  --revel revel_with_transcript_ids \
  --out clinvar_training.tsv \
  --min-stars 2 --drop-empty-genes

For the full multi-score feature set (REVEL, CADD, AlphaMissense, conservation, …), build_dataset.py ingests ClinVar annotated with dbNSFP (requires dbNSFP's institutional-email academic download).

Train, evaluate, and classify

# evaluate (gene-grouped CV) and save the trained model + calibrator
python variant_classifier.py --mode real --data clinvar_training.tsv \
  --save-model variant_model.joblib

# classify NEW variants (a TSV with the model's feature columns)
printf "gene\tREVEL_score\tgnomAD_AF\nBRCA1\t0.95\t0.000001\nCFTR\t0.08\t0.03\n" > new_variants.tsv
python variant_classifier.py --mode predict \
  --load-model variant_model.joblib --data new_variants.tsv --out preds.tsv

preds.tsv gains three columns: pathogenic_prob, predicted_class, acmg_evidence. (To score a genuinely novel variant you must first annotate it with the same features REVEL score and allele frequency — exactly as the training data was built.)

Results (real ClinVar data)

74,368 ClinVar 2-star missense variants across 7,431 genes (34.5% pathogenic); gene-grouped 5-fold cross-validation.

Model Features AUROC AUPRC
Full REVEL + allele frequency 0.989 0.981
REVEL only REVEL 0.959 0.941
Allele frequency only allele frequency 0.916 0.772

Calibration is good (Brier 0.039 → 0.036 after isotonic).

  • Allele-frequency label leakage. ClinVar applies frequency-based ACMG criteria (BA1/BS1) when assigning many benign labels, so allele frequency is partly upstream of the label. Allele frequency alone reaching AUROC 0.92 but only AUPRC 0.77 is the signature of this leakage (it separates the easy bulk but ranks pathogenicity poorly).
  • REVEL circularity (type-2). REVEL was trained on disease/neutral sets that overlap ClinVar's ascertainment, so evaluating it on ClinVar is inherently optimistic.

The REVEL-only result (AUROC 0.96, AUPRC 0.94) better reflects genuine predictive content. Reported numbers should be read with these caveats.

Repository layout

File Purpose
variant_classifier.py train / evaluate (gene-grouped CV) / calibrate / save / predict
build_from_clinvar_revel.py ClinVar (labels+AF) + REVEL (score) → training TSV recommended, free
build_dataset.py ClinVar/dbNSFP → training TSV full multi-score feature set
load_kaggle_clinvar.py alternative loader: Kaggle ClinVar VEP features + ClinVar labels
DATASET_GUIDE.md data download & build recipes

Data licensing

This repo ships code only. The data sources have their own terms REVEL (free, non-commercial), AlphaMissense (CC BY-NC-SA), dbNSFP (academic / non-commercial; some component scores restricted). Download them yourself; do not redistribute them here. The synthetic --mode demo lets anyone run the pipeline without any external data.

References

REVEL (Ioannidis 2016) · ClinPred (Alirezaie 2018) · CADD (Rentzsch 2019) · AlphaMissense (Cheng 2023) · EVE (Frazer 2021) · ESM1b (Brandes 2023) · Pejaver (2022) · Tavtigian (2018) · Grimm circularity (2015) · dbNSFP (Liu 2020).

License

MIT (code). See LICENSE.

About

Variant pathogenicity classifier with calibrated probabilities, gene-grouped cross-validation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages