A deep learning system for predicting adverse drug reactions (ADRs) using Extended Connectivity Fingerprints (ECFP) molecular representations and multi-label classification.
This project develops a machine learning pipeline to predict multiple side effects for drugs based on their molecular structure. It combines:
- Molecular fingerprinting: ECFP-4 representation of drug molecules
- Deep learning: Multi-label classification using PyTorch
- Drug data enrichment: ChEMBL and PubChem integration for SMILES canonicalization
- Web interface: Flask-based drug lookup and prediction UI
- π Multi-label prediction on 300+ side effects
- π¬ Molecular fingerprint-based features (ECFP-4)
- π§ PyTorch neural network with optimized thresholds
- π Interactive web interface for drug queries
- π Comprehensive performance metrics and visualizations
- π Data processing pipeline for drug SMILES enrichment
.
βββ app.py # Flask web application
βββ model.py # Neural network architecture
βββ predict.py # Inference module
βββ drug_lookup.py # Drug database queries
βββ generate_metrics.py # Performance evaluation
βββ requirements.txt # Python dependencies
βββ checkpoints/ # Trained model files
β βββ best_model.pt # Best performing model
β βββ final_model.pt # Final trained model
β βββ inference_bundle.joblib # Pre-processed inference data
β βββ label_names.json # Side effect label mapping
β βββ thresholds.npy # Optimized prediction thresholds
β βββ training_history.json # Training metrics
βββ Data Processing/ # ETL pipeline scripts
β βββ smiles_from_chembl.py # Fetch SMILES from ChEMBL
β βββ pubchem_for_missing.py # Backfill missing SMILES
β βββ Outliers_removal.py # Data cleaning
βββ static/ # Web UI assets (CSS, JS)
βββ templates/ # HTML templates
βββ plots/ # Generated visualizations
- Python 3.8+
- pip or conda
-
Clone the repository
git clone https://github.com/CharithKalasi/DrugSideEffectPrediction.git cd DrugSideEffectPrediction -
Create a virtual environment (recommended)
# Windows (PowerShell) python -m venv venv .\venv\Scripts\Activate.ps1 # Linux/macOS python -m venv venv source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Verify setup
python validate_setup.py
Launch the interactive drug lookup and prediction interface:
python app.pyAccess at http://localhost:5000 in your browser.
Get predictions for a drug:
from predict import predict_side_effects
# Predict side effects for a drug
predictions = predict_side_effects("aspirin")
print(predictions)Generate and display performance metrics:
python generate_metrics.py
python view_results.pyThe Data Processing/ folder contains scripts for enriching drug SMILES data:
cd "Data Processing"
# 1. Fetch SMILES from ChEMBL
python smiles_from_chembl.py
# 2. Backfill missing SMILES from PubChem
python pubchem_for_missing.py
# 3. Apply data cleaning corrections
python Outliers_removal.pyInputs/Outputs:
drug_names.csvβdrug_smiles.csvβdrug_smiles_completed.csvfinal_multilabel_ADR_dataset.csvβfinal_multilabel_ADR_dataset_updated.csv
The neural network for multi-label classification:
- Input: ECFP-4 fingerprints (2048 features)
- Architecture: 3-layer fully connected network with batch normalization
- Output: 300+ binary predictions (one per side effect)
- Optimization: Binary cross-entropy loss with label smoothing
For details, see model.py.
| Package | Version | Purpose |
|---|---|---|
torch |
β₯2.0.0 | Deep learning framework |
pandas |
β₯2.0.0 | Data manipulation |
numpy |
β₯1.24.0 | Numerical computing |
scikit-learn |
β₯1.3.0 | ML utilities & metrics |
rdkit |
β₯2023.3.1 | Molecular fingerprinting |
flask |
(implicit) | Web framework |
tqdm |
β₯4.65.0 | Progress bars |
joblib |
β₯1.3.0 | Serialization |
matplotlib |
β₯3.7.0 | Visualization |
seaborn |
β₯0.12.0 | Statistical plots |
The model achieves strong multi-label classification performance on 300+ side effects:
- Detailed metrics available in
plots/detailed_metrics.txt - Confusion matrices and performance plots in
plots/
python app.py
# Navigate to http://localhost:5000
# Enter drug name to get predictionsfrom predict import predict_side_effects
import pandas as pd
drugs = ["aspirin", "ibuprofen", "acetaminophen"]
for drug in drugs:
results = predict_side_effects(drug)
print(f"{drug}: {results['top_side_effects']}")python generate_metrics.py
# Metrics exported to plots/detailed_metrics_table.csvKey parameters in model.py:
BATCH_SIZE: Training batch size (default: 32)LEARNING_RATE: Optimizer learning rate (default: 0.001)THRESHOLD: Prediction threshold (optimized per label, stored incheckpoints/thresholds.npy)
Issue: RDKit import error
- Solution: Ensure
rdkitis installed:pip install rdkit
Issue: Model checkpoint not found
- Solution: Verify checkpoint files exist in
checkpoints/directory
Issue: Web app connection refused
- Solution: Check if port 5000 is available or modify Flask port in
app.py
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request with a clear description
This project is open source. See LICENSE file for details.
If you use this project in research, please cite:
@software{DrugSideEffectPrediction2024,
title={AI-Based Drug Side-Effect Prediction Using Molecular Fingerprints and Deep Learning},
author={Kalasi, Charith},
year={2024},
url={https://github.com/CharithKalasi/DrugSideEffectPrediction}
}- RDKit: Molecular fingerprinting - https://www.rdkit.org/
- ChEMBL: Drug database - https://www.ebi.ac.uk/chembl/
- PubChem: Chemical compound database - https://pubchem.ncbi.nlm.nih.gov/
- PyTorch: Deep learning framework - https://pytorch.org/
For questions or issues, please open a GitHub issue or contact the maintainer.
Last Updated: December 2024
Status: Active Development