AI-Based Drug Side-Effect Prediction Using Molecular Fingerprints and Deep Learning

A deep learning system for predicting adverse drug reactions (ADRs) using Extended Connectivity Fingerprints (ECFP) molecular representations and multi-label classification.

Overview

This project develops a machine learning pipeline to predict multiple side effects for drugs based on their molecular structure. It combines:

Molecular fingerprinting: ECFP-4 representation of drug molecules
Deep learning: Multi-label classification using PyTorch
Drug data enrichment: ChEMBL and PubChem integration for SMILES canonicalization
Web interface: Flask-based drug lookup and prediction UI

Features

📊 Multi-label prediction on 300+ side effects
🔬 Molecular fingerprint-based features (ECFP-4)
🧠 PyTorch neural network with optimized thresholds
🌐 Interactive web interface for drug queries
📈 Comprehensive performance metrics and visualizations
🔄 Data processing pipeline for drug SMILES enrichment

Project Structure

.
├── app.py                          # Flask web application
├── model.py                        # Neural network architecture
├── predict.py                      # Inference module
├── drug_lookup.py                  # Drug database queries
├── generate_metrics.py             # Performance evaluation
├── requirements.txt                # Python dependencies
├── checkpoints/                    # Trained model files
│   ├── best_model.pt              # Best performing model
│   ├── final_model.pt             # Final trained model
│   ├── inference_bundle.joblib    # Pre-processed inference data
│   ├── label_names.json           # Side effect label mapping
│   ├── thresholds.npy             # Optimized prediction thresholds
│   └── training_history.json      # Training metrics
├── Data Processing/               # ETL pipeline scripts
│   ├── smiles_from_chembl.py      # Fetch SMILES from ChEMBL
│   ├── pubchem_for_missing.py     # Backfill missing SMILES
│   └── Outliers_removal.py        # Data cleaning
├── static/                         # Web UI assets (CSS, JS)
├── templates/                      # HTML templates
└── plots/                         # Generated visualizations

Installation

Prerequisites

Python 3.8+
pip or conda

Setup

Clone the repository

git clone https://github.com/CharithKalasi/DrugSideEffectPrediction.git
cd DrugSideEffectPrediction

Create a virtual environment (recommended)

# Windows (PowerShell)
python -m venv venv
.\venv\Scripts\Activate.ps1

# Linux/macOS
python -m venv venv
source venv/bin/activate

Install dependencies
```
pip install -r requirements.txt
```
Verify setup
```
python validate_setup.py
```

Quick Start

Web Interface

Launch the interactive drug lookup and prediction interface:

python app.py

Access at http://localhost:5000 in your browser.

Command-Line Predictions

Get predictions for a drug:

from predict import predict_side_effects

# Predict side effects for a drug
predictions = predict_side_effects("aspirin")
print(predictions)

View Results & Metrics

Generate and display performance metrics:

python generate_metrics.py
python view_results.py

Data Processing Pipeline

The Data Processing/ folder contains scripts for enriching drug SMILES data:

cd "Data Processing"

# 1. Fetch SMILES from ChEMBL
python smiles_from_chembl.py

# 2. Backfill missing SMILES from PubChem
python pubchem_for_missing.py

# 3. Apply data cleaning corrections
python Outliers_removal.py

Inputs/Outputs:

drug_names.csv → drug_smiles.csv → drug_smiles_completed.csv
final_multilabel_ADR_dataset.csv → final_multilabel_ADR_dataset_updated.csv

Model Architecture

The neural network for multi-label classification:

Input: ECFP-4 fingerprints (2048 features)
Architecture: 3-layer fully connected network with batch normalization
Output: 300+ binary predictions (one per side effect)
Optimization: Binary cross-entropy loss with label smoothing

For details, see model.py.

Dependencies

Package	Version	Purpose
`torch`	≥2.0.0	Deep learning framework
`pandas`	≥2.0.0	Data manipulation
`numpy`	≥1.24.0	Numerical computing
`scikit-learn`	≥1.3.0	ML utilities & metrics
`rdkit`	≥2023.3.1	Molecular fingerprinting
`flask`	(implicit)	Web framework
`tqdm`	≥4.65.0	Progress bars
`joblib`	≥1.3.0	Serialization
`matplotlib`	≥3.7.0	Visualization
`seaborn`	≥0.12.0	Statistical plots

Performance Metrics

The model achieves strong multi-label classification performance on 300+ side effects:

Detailed metrics available in plots/detailed_metrics.txt
Confusion matrices and performance plots in plots/

Usage Examples

Example 1: Predict side effects for a drug

python app.py
# Navigate to http://localhost:5000
# Enter drug name to get predictions

Example 2: Batch predictions

from predict import predict_side_effects
import pandas as pd

drugs = ["aspirin", "ibuprofen", "acetaminophen"]
for drug in drugs:
    results = predict_side_effects(drug)
    print(f"{drug}: {results['top_side_effects']}")

Example 3: Export results

python generate_metrics.py
# Metrics exported to plots/detailed_metrics_table.csv

Configuration

Key parameters in model.py:

BATCH_SIZE: Training batch size (default: 32)
LEARNING_RATE: Optimizer learning rate (default: 0.001)
THRESHOLD: Prediction threshold (optimized per label, stored in checkpoints/thresholds.npy)

Troubleshooting

Issue: RDKit import error

Solution: Ensure rdkit is installed: pip install rdkit

Issue: Model checkpoint not found

Solution: Verify checkpoint files exist in checkpoints/ directory

Issue: Web app connection refused

Solution: Check if port 5000 is available or modify Flask port in app.py

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Submit a pull request with a clear description

License

This project is open source. See LICENSE file for details.

Citation

If you use this project in research, please cite:

@software{DrugSideEffectPrediction2024,
  title={AI-Based Drug Side-Effect Prediction Using Molecular Fingerprints and Deep Learning},
  author={Kalasi, Charith},
  year={2024},
  url={https://github.com/CharithKalasi/DrugSideEffectPrediction}
}

References

RDKit: Molecular fingerprinting - https://www.rdkit.org/
ChEMBL: Drug database - https://www.ebi.ac.uk/chembl/
PubChem: Chemical compound database - https://pubchem.ncbi.nlm.nih.gov/
PyTorch: Deep learning framework - https://pytorch.org/

Contact

For questions or issues, please open a GitHub issue or contact the maintainer.

Last Updated: December 2024
Status: Active Development

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Based Drug Side-Effect Prediction Using Molecular Fingerprints and Deep Learning

Overview

Features

Project Structure

Installation

Prerequisites

Setup

Quick Start

Web Interface

Command-Line Predictions

View Results & Metrics

Data Processing Pipeline

Model Architecture

Dependencies

Performance Metrics

Usage Examples

Example 1: Predict side effects for a drug

Example 2: Batch predictions

Example 3: Export results

Configuration

Troubleshooting

Contributing

License

Citation

References

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
Data Processing		Data Processing
about project		about project
plots		plots
static		static
templates		templates
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
app.py		app.py
demo_metrics.py		demo_metrics.py
drug_lookup.py		drug_lookup.py
generate_metrics.py		generate_metrics.py
model.py		model.py
predict.py		predict.py
requirements.txt		requirements.txt
run.py		run.py
setup.ps1		setup.ps1
validate_setup.py		validate_setup.py
view_results.py		view_results.py

Folders and files

Latest commit

History

Repository files navigation

AI-Based Drug Side-Effect Prediction Using Molecular Fingerprints and Deep Learning

Overview

Features

Project Structure

Installation

Prerequisites

Setup

Quick Start

Web Interface

Command-Line Predictions

View Results & Metrics

Data Processing Pipeline

Model Architecture

Dependencies

Performance Metrics

Usage Examples

Example 1: Predict side effects for a drug

Example 2: Batch predictions

Example 3: Export results

Configuration

Troubleshooting

Contributing

License

Citation

References

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages