Skip to content

CharithKalasi/DrugSideEffectPrediction

Repository files navigation

AI-Based Drug Side-Effect Prediction Using Molecular Fingerprints and Deep Learning

A deep learning system for predicting adverse drug reactions (ADRs) using Extended Connectivity Fingerprints (ECFP) molecular representations and multi-label classification.

Overview

This project develops a machine learning pipeline to predict multiple side effects for drugs based on their molecular structure. It combines:

  • Molecular fingerprinting: ECFP-4 representation of drug molecules
  • Deep learning: Multi-label classification using PyTorch
  • Drug data enrichment: ChEMBL and PubChem integration for SMILES canonicalization
  • Web interface: Flask-based drug lookup and prediction UI

Features

  • πŸ“Š Multi-label prediction on 300+ side effects
  • πŸ”¬ Molecular fingerprint-based features (ECFP-4)
  • 🧠 PyTorch neural network with optimized thresholds
  • 🌐 Interactive web interface for drug queries
  • πŸ“ˆ Comprehensive performance metrics and visualizations
  • πŸ”„ Data processing pipeline for drug SMILES enrichment

Project Structure

.
β”œβ”€β”€ app.py                          # Flask web application
β”œβ”€β”€ model.py                        # Neural network architecture
β”œβ”€β”€ predict.py                      # Inference module
β”œβ”€β”€ drug_lookup.py                  # Drug database queries
β”œβ”€β”€ generate_metrics.py             # Performance evaluation
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ checkpoints/                    # Trained model files
β”‚   β”œβ”€β”€ best_model.pt              # Best performing model
β”‚   β”œβ”€β”€ final_model.pt             # Final trained model
β”‚   β”œβ”€β”€ inference_bundle.joblib    # Pre-processed inference data
β”‚   β”œβ”€β”€ label_names.json           # Side effect label mapping
β”‚   β”œβ”€β”€ thresholds.npy             # Optimized prediction thresholds
β”‚   └── training_history.json      # Training metrics
β”œβ”€β”€ Data Processing/               # ETL pipeline scripts
β”‚   β”œβ”€β”€ smiles_from_chembl.py      # Fetch SMILES from ChEMBL
β”‚   β”œβ”€β”€ pubchem_for_missing.py     # Backfill missing SMILES
β”‚   └── Outliers_removal.py        # Data cleaning
β”œβ”€β”€ static/                         # Web UI assets (CSS, JS)
β”œβ”€β”€ templates/                      # HTML templates
└── plots/                         # Generated visualizations

Installation

Prerequisites

  • Python 3.8+
  • pip or conda

Setup

  1. Clone the repository

    git clone https://github.com/CharithKalasi/DrugSideEffectPrediction.git
    cd DrugSideEffectPrediction
  2. Create a virtual environment (recommended)

    # Windows (PowerShell)
    python -m venv venv
    .\venv\Scripts\Activate.ps1
    
    # Linux/macOS
    python -m venv venv
    source venv/bin/activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Verify setup

    python validate_setup.py

Quick Start

Web Interface

Launch the interactive drug lookup and prediction interface:

python app.py

Access at http://localhost:5000 in your browser.

Command-Line Predictions

Get predictions for a drug:

from predict import predict_side_effects

# Predict side effects for a drug
predictions = predict_side_effects("aspirin")
print(predictions)

View Results & Metrics

Generate and display performance metrics:

python generate_metrics.py
python view_results.py

Data Processing Pipeline

The Data Processing/ folder contains scripts for enriching drug SMILES data:

cd "Data Processing"

# 1. Fetch SMILES from ChEMBL
python smiles_from_chembl.py

# 2. Backfill missing SMILES from PubChem
python pubchem_for_missing.py

# 3. Apply data cleaning corrections
python Outliers_removal.py

Inputs/Outputs:

  • drug_names.csv β†’ drug_smiles.csv β†’ drug_smiles_completed.csv
  • final_multilabel_ADR_dataset.csv β†’ final_multilabel_ADR_dataset_updated.csv

Model Architecture

The neural network for multi-label classification:

  • Input: ECFP-4 fingerprints (2048 features)
  • Architecture: 3-layer fully connected network with batch normalization
  • Output: 300+ binary predictions (one per side effect)
  • Optimization: Binary cross-entropy loss with label smoothing

For details, see model.py.

Dependencies

Package Version Purpose
torch β‰₯2.0.0 Deep learning framework
pandas β‰₯2.0.0 Data manipulation
numpy β‰₯1.24.0 Numerical computing
scikit-learn β‰₯1.3.0 ML utilities & metrics
rdkit β‰₯2023.3.1 Molecular fingerprinting
flask (implicit) Web framework
tqdm β‰₯4.65.0 Progress bars
joblib β‰₯1.3.0 Serialization
matplotlib β‰₯3.7.0 Visualization
seaborn β‰₯0.12.0 Statistical plots

Performance Metrics

The model achieves strong multi-label classification performance on 300+ side effects:

  • Detailed metrics available in plots/detailed_metrics.txt
  • Confusion matrices and performance plots in plots/

Usage Examples

Example 1: Predict side effects for a drug

python app.py
# Navigate to http://localhost:5000
# Enter drug name to get predictions

Example 2: Batch predictions

from predict import predict_side_effects
import pandas as pd

drugs = ["aspirin", "ibuprofen", "acetaminophen"]
for drug in drugs:
    results = predict_side_effects(drug)
    print(f"{drug}: {results['top_side_effects']}")

Example 3: Export results

python generate_metrics.py
# Metrics exported to plots/detailed_metrics_table.csv

Configuration

Key parameters in model.py:

  • BATCH_SIZE: Training batch size (default: 32)
  • LEARNING_RATE: Optimizer learning rate (default: 0.001)
  • THRESHOLD: Prediction threshold (optimized per label, stored in checkpoints/thresholds.npy)

Troubleshooting

Issue: RDKit import error

  • Solution: Ensure rdkit is installed: pip install rdkit

Issue: Model checkpoint not found

  • Solution: Verify checkpoint files exist in checkpoints/ directory

Issue: Web app connection refused

  • Solution: Check if port 5000 is available or modify Flask port in app.py

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request with a clear description

License

This project is open source. See LICENSE file for details.

Citation

If you use this project in research, please cite:

@software{DrugSideEffectPrediction2024,
  title={AI-Based Drug Side-Effect Prediction Using Molecular Fingerprints and Deep Learning},
  author={Kalasi, Charith},
  year={2024},
  url={https://github.com/CharithKalasi/DrugSideEffectPrediction}
}

References

Contact

For questions or issues, please open a GitHub issue or contact the maintainer.


Last Updated: December 2024
Status: Active Development

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors