Virtual screening pipeline for kinase inhibitor discovery using LightGBM on sparse molecular fingerprints
Protein kinases are master regulators of cell proliferation and angiogenesis — their dysregulation drives multiple cancer types. This pipeline predicts the inhibitory activity of small molecules against 8 oncologically critical kinases using molecular fingerprints, then applies the trained models to perform virtual screening across 70,249 PubChem compounds to identify novel drug candidates.
Key results: AUC-PRC > 0.80 on membrane-located kinases (EGFR, VEGFR2); identification of a Top 100 high-confidence candidates (probability > 0.95) converging on Lck as the most druggable target in the screened library.
Kinase inhibitors represent one of the main pillars of modern targeted cancer therapy (imatinib, erlotinib, sorafenib). Their rational design requires identifying molecules capable of binding the ATP pocket and outcompeting the natural substrate — a needle-in-a-haystack problem across chemical space with millions of candidate molecules.
This project addresses two questions:
- Structural modeling: Which molecular substructures (fingerprints) determine binding selectivity for membrane vs. intracellular kinases?
- Virtual discovery: Can a gradient boosting model trained on validated ChEMBL inhibitors generalize to identify novel candidates from an untested PubChem library?
Source: Kaggle — Cancer Inhibitors (Protein Kinase)
Training data derived from ChEMBL (validated inhibitors, IC50 < 10 µM). Virtual screening target: a 70,249-molecule PubChem library of uncharacterized compounds.
| Set | Purpose | Size | Class balance |
|---|---|---|---|
[kinase].h5 files |
Training / validation per target | 1,000 – 6,000 molecules | Severe imbalance (~1:10 positives) |
pubchem_neg... |
Virtual screening candidates | 70,249 molecules | Unknown (searching for false negatives) |
8 kinase targets covered: EGFR (erbB1), erbB2, VEGFR2, SRC, ABL, CDK2, Lck, ALK — spanning membrane receptors and intracellular kinases.
Rather than applying dimensionality reduction (PCA/MCA) that destroys chemical interpretability, this pipeline preserves the full 6,117-dimensional molecular fingerprint space using scipy.sparse matrices, reducing memory usage by 95% relative to dense representation.
Three complementary RDKit fingerprints are computed per molecule:
| Fingerprint type | Dimensions | Biochemical relevance |
|---|---|---|
| Atom Pairs (AP) | 2,039 | Captures topological distances between atom pairs; key for fitting deep binding pockets of intracellular kinases |
| Morgan circular (MG) | 2,039 | Encodes the local atomic environment; models lipophilicity and aromatic rings critical for membrane permeability |
| Topological Torsion (TT) | 2,039 | Represents implicit 3D conformational flexibility from torsion angles |
Algorithm: LightGBM (gradient boosting optimized for sparse binary data)
The primary statistical challenge is severe class imbalance (~1:10). Implemented solutions:
scale_pos_weight: dynamic false-negative penalization computed fold-by-fold- Optuna TPE sampler: Bayesian hyperparameter search,
max_bin=63adapted for sparse binary inputs - Target metric: AUC-PRC (Precision-Recall AUC) — accuracy is misleading under severe imbalance in drug screening contexts
| Common error in public notebooks | Impact | Solution implemented |
|---|---|---|
| Evaluating on training data | Artificially inflated ~99% accuracy | Stratified split + cross-validation |
| MCA to 800 components | RAM crash (>30 min) + black-box | Native sparse handling (trains in seconds) |
| Ignoring class imbalance | Recall as low as 0.29 | scale_pos_weight + F1/AUC-PRC optimization |
ROC curves suggest near-perfect performance across all kinases. PRC curves reveal the true capacity to find inhibitors (signal) without generating excessive false positives (noise) — the metric used in industry.
Left: ROC curves (AUC > 0.95 for most targets). Right: PRC curves showing true performance adjusted for class imbalance.
Post-hoc stratification by cellular localization reveals that membrane kinases (EGFR, VEGFR2) are significantly more predictable than intracellular targets (CDK2). This is consistent with the more restricted chemical space of membrane-binding inhibitors.
Membrane kinases (red) show systematically higher AUC-PRC and F1-Score than cytoplasmic kinases (blue).
Feature importance analysis (Gain) shows that Morgan fingerprints (MG) dominate predictions for membrane kinases. Biologically, this validates that the model learned to identify lipophilic and aromatic local groups required for transmembrane receptor interaction — not statistical noise.
For specific targets like EGFR, the exact fingerprint bits driving inhibition predictions can be mapped to molecular substructures:
Applied to 70,249 uncharacterized PubChem molecules, the pipeline distilled the chemical space to a Top 100 high-confidence candidates (probability > 0.95). The highest-scoring compounds converge on Lck (Lymphocyte-specific protein tyrosine kinase), a clinically relevant target in T-cell Acute Lymphoblastic Leukemia (T-ALL).
| PubChem CID | Predicted target | Confidence | Structural rationale |
|---|---|---|---|
| 68058875 | TPK_LCK | 98.5% | Pyrazolopyrimidine core — privileged scaffold for hinge-region hydrogen bonding in Lck |
| 67593796 | TPK_LCK | 97.2% | Aromatic substituents complement the deep hydrophobic pocket of the Src kinase family |
| 68058868 | TPK_LCK | 96.8% | Compound identified in high-throughput screening (HTS) for SFK family — validates model generalization |
| 58289963 | TPK_LCK | 95.1% | Strict Lipinski Rule-of-5 compliance; robust selectivity profile for in vitro follow-up |
The convergence toward Lck is not coincidental — Lck's role in TCR signaling and its lymphoid-restricted expression make it a high-precision target that minimizes off-target toxicity.
The nitrogen-containing heterocyclic cores identified by the model are consistent with experimentally validated kinase inhibitor scaffolds:
- Quinoxaline derivatives are established privileged scaffolds for Src kinase family inhibition, with demonstrated antitumor activity (Al-Marhabi et al., 2015, Molecules)
- 4-Phenoxyquinoline analogs show that positions 6 and 7 substitution critically determines potency and selectivity for membrane tyrosine kinases (Liu et al., 2014, Bioorganic & Medicinal Chemistry)
The model's preference for Morgan fingerprints (local aromatic environments) maps directly to these experimentally critical structural positions.
- SMILES extraction: Map top
PubChem_CIDhits to SMILES via ChEMBL API for structural visualization - Scaffold analysis:
MurckoScaffold(RDKit) to cluster candidates by base scaffold and prioritize non-patented structural families - ADMET filtering: Pass top candidates through ADMET predictors (pkCSM, SwissADME) before considering synthesis
- Experimental validation: IC50 measurements for top 5–10 candidates against recombinant Lck
CanInh-Model/
├── notebooks/
│ └── analysis_notebook.ipynb # EDA, Plotly visualizations, virtual screening
├── src/
│ ├── config.py # Paths and hyperparameters (single source of truth)
│ ├── audit_dataset.py # Memory-efficient HDF5/CSV auditing
│ ├── data_loading.py # Sparse matrix loading
│ ├── feature_engineering.py # Fingerprint processing + VarianceThreshold
│ ├── lgbm_model.py # LightGBM training + Optuna tuning
│ ├── virtual_screening.py # PubChem prediction engine
│ └── interpretation.py # Feature importance + cellular localization analysis
├── results/
│ ├── models/ # Trained .pkl artifacts (not included)
│ └── plots/ # Performance and validation figures
├── run_pipeline.py # Main CLI orchestrator
├── requirements.txt
└── README.md
Optimized for macOS M1/M2 with Miniforge/Conda:
# 1. Clone and set up environment
git clone https://github.com/CANOLIO/CanInh-Model.git
cd CanInh-Model
conda create -n inhibitors python=3.11
conda activate inhibitors
pip install -r requirements.txt
conda install -c conda-forge rdkit
# 2. Configure data paths
# Edit src/config.py to point to the directory with the .h5 files downloaded from Kaggle.
# 3. Run full training + hyperparameter tuning
python run_pipeline.py --tune
# 4. Train only high-priority membrane targets
python run_pipeline.py --membrane_onlyFor interactive virtual screening, open notebooks/analysis_notebook.ipynb after training.
- Al-Marhabi, A. R., Abbas, H. S., & Ammar, Y. A. (2015). Synthesis, Characterization and Biological Evaluation of Some Quinoxaline Derivatives. Molecules, 20(11), 19805–19822.
- Liu, Z., et al. (2014). Design, synthesis and biological evaluation of novel 6,7-disubstituted-4-phenoxyquinoline derivatives as c-Met kinase inhibitors. Bioorganic & Medicinal Chemistry, 22(14), 3642–3653.
- Ke, G., et al. (2017). LightGBM: A highly efficient gradient boosting decision tree. NeurIPS.
Fabián Rojas — Biochemist & Computational Biologist · Valdivia, Chile
MIT License — see LICENSE for details.



