Variant Pathogenicity Classifier

A calibrated machine-learning classifier for the pathogenicity of missense variants. It ensembles established in-silico predictors, validates with gene-grouped cross-validation, calibrates its probabilities, and maps them to ACMG/AMP computational evidence strengths (PP3 / BP4). It trains, evaluates, and classifies new variants.

What it does

Input: a missense variant's in-silico features (e.g. REVEL score and population allele frequency; the design also supports CADD, AlphaMissense, conservation, SIFT/PolyPhen, etc.).
Output: a calibrated probability of pathogenicity, a Pathogenic/Benign call, and an ACMG computational-evidence tier (PP3_Supporting … PP3_VeryStrong / BP4_Supporting … BP4_Strong).
Scope: missense variants. It supplies the computational evidence component one input to a full clinical ACMG/AMP classification not a standalone clinical verdict.

Methodology

Model: gradient-boosted trees (LightGBM; scikit-learn HistGradientBoostingClassifier fallback). Both handle the heavy missingness of in-silico scores natively. Monotone constraints enforce sensible directionality.
Anti-circularity (Grimm et al., Hum Mutat 2015): gene-grouped CV (GroupKFold) defeats type-1 circularity; --exclude-meta (primary features only) probes type-2 circularity.
Calibration: isotonic regression on out-of-fold predictions → trustworthy posteriors.
Clinical mapping: posteriors → local likelihood ratios → ACMG evidence tiers (Tavtigian/Pejaver).
Interpretability: SHAP (falls back to permutation importance).

Install

pip install -r requirements.txt

Quickstart (synthetic data, runs anywhere)

python variant_classifier.py --mode demo

Build a real training set (free, no institutional access)

Labels come from ClinVar (free); features from REVEL (free, no institutional email). See DATASET_GUIDE.md for the full recipe.

# ClinVar VCF (labels + allele frequency) and REVEL CSV (predictor score), then:
python build_from_clinvar_revel.py \
  --clinvar-vcf clinvar.vcf.gz \
  --revel revel_with_transcript_ids \
  --out clinvar_training.tsv \
  --min-stars 2 --drop-empty-genes

For the full multi-score feature set (REVEL, CADD, AlphaMissense, conservation, …), build_dataset.py ingests ClinVar annotated with dbNSFP (requires dbNSFP's institutional-email academic download).

Train, evaluate, and classify

# evaluate (gene-grouped CV) and save the trained model + calibrator
python variant_classifier.py --mode real --data clinvar_training.tsv \
  --save-model variant_model.joblib

# classify NEW variants (a TSV with the model's feature columns)
printf "gene\tREVEL_score\tgnomAD_AF\nBRCA1\t0.95\t0.000001\nCFTR\t0.08\t0.03\n" > new_variants.tsv
python variant_classifier.py --mode predict \
  --load-model variant_model.joblib --data new_variants.tsv --out preds.tsv

preds.tsv gains three columns: pathogenic_prob, predicted_class, acmg_evidence. (To score a genuinely novel variant you must first annotate it with the same features REVEL score and allele frequency — exactly as the training data was built.)

Results (real ClinVar data)

74,368 ClinVar 2-star missense variants across 7,431 genes (34.5% pathogenic); gene-grouped 5-fold cross-validation.

Model	Features	AUROC	AUPRC
Full	REVEL + allele frequency	0.989	0.981
REVEL only	REVEL	0.959	0.941
Allele frequency only	allele frequency	0.916	0.772

Calibration is good (Brier 0.039 → 0.036 after isotonic).

Allele-frequency label leakage. ClinVar applies frequency-based ACMG criteria (BA1/BS1) when assigning many benign labels, so allele frequency is partly upstream of the label. Allele frequency alone reaching AUROC 0.92 but only AUPRC 0.77 is the signature of this leakage (it separates the easy bulk but ranks pathogenicity poorly).
REVEL circularity (type-2). REVEL was trained on disease/neutral sets that overlap ClinVar's ascertainment, so evaluating it on ClinVar is inherently optimistic.

The REVEL-only result (AUROC 0.96, AUPRC 0.94) better reflects genuine predictive content. Reported numbers should be read with these caveats.

Repository layout

File	Purpose
`variant_classifier.py`	train / evaluate (gene-grouped CV) / calibrate / save / predict
`build_from_clinvar_revel.py`	ClinVar (labels+AF) + REVEL (score) → training TSV recommended, free
`build_dataset.py`	ClinVar/dbNSFP → training TSV full multi-score feature set
`load_kaggle_clinvar.py`	alternative loader: Kaggle ClinVar VEP features + ClinVar labels
`DATASET_GUIDE.md`	data download & build recipes

Data licensing

This repo ships code only. The data sources have their own terms REVEL (free, non-commercial), AlphaMissense (CC BY-NC-SA), dbNSFP (academic / non-commercial; some component scores restricted). Download them yourself; do not redistribute them here. The synthetic --mode demo lets anyone run the pipeline without any external data.

References

REVEL (Ioannidis 2016) · ClinPred (Alirezaie 2018) · CADD (Rentzsch 2019) · AlphaMissense (Cheng 2023) · EVE (Frazer 2021) · ESM1b (Brandes 2023) · Pejaver (2022) · Tavtigian (2018) · Grimm circularity (2015) · dbNSFP (Liu 2020).

License

MIT (code). See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variant Pathogenicity Classifier

What it does

Methodology

Install

Quickstart (synthetic data, runs anywhere)

Build a real training set (free, no institutional access)

Train, evaluate, and classify

Results (real ClinVar data)

Repository layout

Data licensing

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
DATASET_GUIDE.md		DATASET_GUIDE.md
LICENSE		LICENSE
README.md		README.md
build_dataset.py		build_dataset.py
build_from_clinvar_revel.py		build_from_clinvar_revel.py
load_kaggle_clinvar.py		load_kaggle_clinvar.py
requirements.txt		requirements.txt
variant_classifier.py		variant_classifier.py

Folders and files

Latest commit

History

Repository files navigation

Variant Pathogenicity Classifier

What it does

Methodology

Install

Quickstart (synthetic data, runs anywhere)

Build a real training set (free, no institutional access)

Train, evaluate, and classify

Results (real ClinVar data)

Repository layout

Data licensing

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages