A state-of-the-art machine learning system for predicting molecular pKa values and performing pH-aware molecular docking, featuring quantum-enhanced ensemble models and comprehensive data processing pipelines.
- Quantum-Enhanced Ensemble: Best-performing model (RΒ² = 0.874) with 55 quantum-inspired features
- Physics-Informed Neural Networks: Incorporates Hammett equations and thermodynamic constraints
- Graph Neural Networks: Molecular graph representations with attention mechanisms
- Random Forest & XGBoost: Robust tree-based baseline models
- 17,000+ molecules from multiple high-quality sources (ChEMBL, IUPAC, SAMPL6)
- Advanced filtering with statistical outlier detection
- Batch data fetching for large-scale dataset collection
- Intelligent duplicate handling with IQR-based filtering
- pH-Aware Chemistry: Protonation states from pH 1-14 with variable step optimization
- Docking Integration: GNINA support with extensible framework (see DOCKING_GUIDE.md)
- Web API: FastAPI backend with receptor management and job queue
- Interactive Frontend: Next.js interface for molecule submission and visualization
- Model Persistence: Saved models for immediate deployment
- Comprehensive Evaluation: Performance metrics and validation framework
- Python 3.8+
- RDKit
- PyTorch
- scikit-learn
- XGBoost
- GNINA (optional, for docking)
# Clone repository
git clone https://github.com/FafnirX26/pHdock.git
cd pHdock
# Create virtual environment
python -m venv phd
source phd/bin/activate # On Windows: phd\Scripts\activate
# Install dependencies
pip install pandas numpy scikit-learn xgboost matplotlib seaborn rdkit torch
# Install GNINA for docking (optional)
# See GNINA_INSTALL.md for detailed instructions
conda install -c conda-forge gninaFor handling large datasets efficiently:
# Install git-lfs (Fedora/RHEL)
sudo dnf install git-lfs
# Initialize git-lfs
git lfs install
git lfs track "training_data/*.csv"# Use the best quantum-enhanced model
python fast_quantum_pka.py
# Generate comprehensive performance summary
python final_summary.py# Large-scale data collection
python batch_data_fetcher.py
# Advanced data filtering
python data_filter.py
# Train advanced ensemble model
python ensemble_train_advanced.py# Basic pKa prediction
python main.py --input molecules.smi --mode pka_prediction --output results/
# Use large dataset
python main.py --input molecules.smi --data_source large --data_limit 10000
# Fetch large data before training
python main.py --input molecules.smi --fetch_large_data --use_batch_fetcher
# Generate protonation states with optimized pH steps
python main.py --input molecules.sdf --mode protonation_states --ph_min 1 --ph_max 14
# Full pipeline with pH-aware docking
python main.py --input molecules.smi --receptor receptors/example/1ATP.pdb \
--mode full_pipeline --docking_tool gnina --ph_min 6 --ph_max 8# Train with different data sources
python main.py --data_source synthetic --data_limit 1000
python main.py --data_source chembl --data_limit 5000
python main.py --data_source combined --data_limit 10000
python main.py --data_source large --data_limit 17000batch_data_fetcher.py: Large-scale data acquisition (17K+ molecules)data_filter.py: Advanced filtering with statistical outlier detectionenhanced_data_fetcher.py: Multi-source data integration
fast_quantum_pka.py: Primary model - Quantum-enhanced ensembleensemble_train_advanced.py: Advanced ensemble with 47 featuresphysics_informed_pka.py: Physics-constrained neural networkgraph_neural_network_pka.py: Graph convolutional networksimple_train.py: Baseline Random Forest model
final_summary.py: Comprehensive performance analysistest_quantum_model.py: Quantum model validationtest_ensemble_advanced.py: Advanced ensemble testing
main.py: Enhanced main pipeline with flexible data optionssrc/data_integration.py: Unified data loading and processingsrc/: Core modules for feature engineering, prediction, and evaluation
- Input Processing: Standardizes SMILES/SDF inputs using RDKit
- Conformer Generation: 3D conformer generation with energy minimization
- Feature Engineering: 55 quantum-inspired molecular descriptors
- Quantum Surrogate: Electronic structure property prediction
- pKa Prediction: Ensemble model with optimized weights
- Protonation Engine: pH-dependent state generation with variable steps
- Docking Integration: Molecular docking with optimal protonation states
- Total molecules: 17,180 (after filtering)
- pKa range: -5.0 to 18.7
- Data retention: 99.7% after quality control
- Sources: ChEMBL, IUPAC, SAMPL6, comprehensive databases
training_data/filtered_pka_dataset.csv: Final cleaned dataset (17,180 records)training_data/combined_pka_dataset_large.csv: Raw combined datasettraining_data/: Individual source datasets and metadata
- Electronic descriptors: Gasteiger charges, HOMO-LUMO gap proxies
- Frontier orbital descriptors: EState indices, partial charges
- Polarization descriptors: Molar refractivity, surface areas, solvation
- Ionization descriptors: Functional group counts, H-bond donors/acceptors
- Aromaticity descriptors: Aromatic fraction, ring analysis
- Modified Z-score outlier detection (threshold: 3.5)
- IQR-based duplicate filtering
- Chemical validation and structure filtering
- Molecular weight filtering (50-1500 Da)
- pKa range validation (-8.0 to 23.0)
- Ensemble weights: XGBoost (0.75) + Random Forest (0.25)
- Feature scaling: StandardScaler with NaN handling
- Cross-validation: 5-fold with stratified sampling
- Hyperparameter optimization: Grid search with early stopping
The protonation engine uses intelligent variable pH step sizes optimized for biological systems:
- Step 1.0: pH_min to pH 6 (coarse sampling)
- Step 0.1: pH 6 to pH 7 (fine sampling)
- Step 0.05: pH 7 to pH 7.6 (ultra-fine sampling)
- Step 0.1: pH 7.6 to pH 8 (fine sampling)
- Step 1.0: pH 8 to pH_max (coarse sampling)
This provides fine-grained sampling around physiologically relevant pH values (6-8) where most pKa transitions occur.
- Acetic acid (4.76), Benzoic acid (4.2), Phenol (9.95)
- Aniline (4.6), Ethylamine (10.7), Pyridine (5.2)
- Imidazole (7.0), Ibuprofen (4.4), Acetaminophen (9.5)
# Test individual models
python test_quantum_model.py
python test_ensemble_advanced.py
python test_data_integration.py
# Full pipeline validation
python test_updated_pipeline.pyThe following trained models are available in models/:
fast_quantum_xgb.pkl: XGBoost with quantum featuresfast_quantum_rf.pkl: Random Forest with quantum featuresfast_quantum_scaler.pkl: Feature scalerfast_quantum_weights.pkl: Ensemble weights
--input: Input file (SMILES or SDF format)--mode: Operating mode (pka_prediction, protonation_states, full_pipeline, train_models)--output: Output directory--data_source: Training data source (synthetic, chembl, combined, large)--data_limit: Maximum number of molecules to load
--fetch_large_data: Run large-scale data fetching before training--use_batch_fetcher: Use advanced batch fetcher for maximum data collection--receptor: Receptor PDB file (for docking)--docking_tool: Docking tool (gnina, diffdock, equibind)--model_type: Model type (xgboost, gnn, ensemble)--save_models,--load_models: Model persistence options
- Input processing: ~1-5 seconds per molecule
- Feature calculation: ~2-10 seconds per molecule
- pKa prediction: <1 second per molecule (after training)
- Protonation state generation: ~1-5 seconds per molecule
- Model training: ~10-60 minutes (depends on dataset size)
- Small dataset (1K molecules): ~500MB RAM
- Large dataset (17K molecules): ~2-4GB RAM
- Model inference: ~100MB RAM per model
# Install development dependencies
pip install pytest jupyter notebook
# Run tests
pytest tests/
# Launch development environment
jupyter notebookThis work represents a significant advancement in computational pKa prediction:
- State-of-the-art performance with quantum-enhanced features
- Physics-informed approach incorporating chemical knowledge
- Large-scale validation on diverse molecular dataset
- Production-ready pipeline for drug discovery applications
The quantum-enhanced ensemble model achieves performance comparable to expensive DFT calculations while maintaining computational efficiency suitable for high-throughput screening.
This project is licensed under the MIT License - see the LICENSE file for details.