A comprehensive pipeline for processing and analyzing chemical compound data, with a focus on psychopharmacological compounds. The system combines data from BindingDB with web-enriched information, patent data, and machine learning predictions.
-
BindingDB Integration
- Automated data processing
- Structure validation and standardization
- Property calculation
- Binding data analysis
-
Additional Data Sources
- ChEMBL API integration
- PubChem data harvesting
- Swiss* services (SwissTargetPrediction, SwissADME)
- Patent database search and analysis
-
Community Data
- PsychonautWiki API integration
- Erowid experience reports
- TripSit factsheets
- Reddit discussions (r/researchchemicals, r/nootropics)
- Twitter mentions and trends
- Bluesky integration
-
Binding Predictions
- Target-specific models
- Cross-target interactions
- Binding site prediction
- Uncertainty estimation
-
Activity Classification
- Mechanism of action
- Effect classification
- Duration prediction
- Structure-activity relationships
-
BBB Permeability Prediction
- Core fingerprint-based prediction
- Transporter analysis (P-gp, BCRP)
- ML model integration
- Web data enrichment
- Comprehensive validation suite
-
Safety Assessment
- Toxicity prediction
- Drug interaction risks
- Side effect profiles
- Abuse potential
-
Structure Analysis
- 2D/3D conformer generation
- Pharmacophore detection
- Similarity search
- Substructure analysis
-
Property Calculation
- Physicochemical properties
- Drug-likeness scores
- ADMET predictions
- Blood-brain barrier penetration
-
Literature Mining
- PubMed integration
- Patent analysis
- Citation tracking
- Regulatory status
-
Compound Browser
- Advanced search and filtering
- Structure visualization
- Activity data display
- Prediction visualization
-
Detail Views
- Chemical properties
- Binding profiles
- Safety information
- Community data
- Patent references
-
Export System
- Flexible column selection
- Custom filtering
- Multiple formats
- Batch processing
- Python 3.8 or higher
- Docker and Docker Compose
- RDKit
- OpenBabel
- PyTorch (optional, for ML features)
- PostgreSQL
- Redis
- Clone the repository:
git clone https://github.com/yourusername/chemdata.git
cd chemdata- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# or
.venv\Scripts\activate # Windows- Install dependencies:
./scripts/setup_dev.sh- Set up environment variables:
cp .env.example .env
# Edit .env with your configuration- Start the services:
docker-compose up -d- Run the pipeline:
./scripts/run_pipeline.sh- Access the web interface:
http://localhost:8000
# Start development container
docker-compose up -d dev
# Enter development shell
docker-compose exec dev bash
# Install development dependencies
./scripts/setup_dev.sh --dev# Run all tests
docker-compose run --rm test
# Run specific tests
docker-compose run --rm test pytest path/to/test.py
# Run tests with coverage
docker-compose run --rm test pytest --cov=binding_data_processor# Run linters
pre-commit run --all-files
# Run type checking
mypy binding_data_processor
# Run security checks
bandit -r binding_data_processor# Build documentation
cd docs
make htmlProcess compounds from BindingDB:
python -m binding_data_processor.cli process-compounds \
--input bindingdb.tsv \
--output results/ \
--enable-ml \
--enable-web \
--enable-socialRun the web interface:
streamlit run examples/web_app/app.pyfrom binding_data_processor.pipeline import ProcessingPipeline
from binding_data_processor.pipeline.config import ProcessingConfig
# Create pipeline
pipeline = ProcessingPipeline(
config=ProcessingConfig(
use_ml_predictions=True,
use_web_enrichment=True,
use_social_monitoring=True,
)
)
# Process compounds
compounds = pipeline.process_compounds(
input_file="bindingdb.tsv",
output_dir="results/",
)
# Use BBB predictor
from binding_data_processor.processors.psychopharm.predictors.bbb import (
BBBPredictorWebEnriched
)
predictor = BBBPredictorWebEnriched(
model_dir="models/bbb",
cache_dir="cache",
)
result = predictor.predict(compound)
print(f"BBB Class: {result.value}")
print(f"Confidence: {result.confidence:.2f}")
print("\nSupporting Data:")
for key, value in result.supporting_data.items():
print(f" {key}: {value}")# Process BindingDB data
./scripts/process_bindingdb.sh \
--input data/raw/BindingDB_All.tsv \
--output data/processed/compounds.tsv \
--workers 4 \
--batch-size 100
# Enrich compounds
./scripts/enrich_compounds.sh \
--input data/processed/compounds.tsv \
--output data/enriched/compounds.tsv \
--workers 4 \
--batch-size 100 \
--rate-limit 2 \
--sources "chembl,pubchem,swiss,community,social"
# Analyze compounds
./scripts/analyze_compounds.sh \
--input data/enriched/compounds.tsv \
--output data/analyzed/compounds.tsv \
--patent-search \
--structure-analysis \
--property-calculation
# Generate report
./scripts/generate_report.sh \
--input data/analyzed/compounds.tsv \
--output-dir reports \
--format html \
--include-plotsbinding_data_processor/
├── data_sources/ # Data source integrations
│ ├── bindingdb.py # BindingDB processing
│ ├── chembl.py # ChEMBL API client
│ └── pubchem.py # PubChem integration
├── models/ # Data models and ML
│ ├── compound/ # Compound data models
│ └── psychopharm/ # Psychopharm models
├── pipeline/ # Processing pipeline
│ ├── base.py # Pipeline coordination
│ ├── ml.py # ML predictions
│ └── web.py # Web enrichment
├── processors/ # Data processors
│ ├── structure/ # Structure processing
│ ├── patent/ # Patent analysis
│ └── psychopharm/ # Psychopharm analysis
├── web_enrichment/ # Web data enrichment
│ ├── manager.py # Enrichment coordination
│ ├── swiss/ # Swiss tools integration
│ └── community/ # Community data sources
└── web/ # Web interface
├── api/ # REST API endpoints
├── components/ # UI components
└── pages/ # Web pages
The application can be configured through environment variables or a .env file:
# Data directories
CHEMDATA_DATA_DIR=./data
CHEMDATA_CACHE_DIR=./cache
CHEMDATA_LOG_DIR=./logs
CHEMDATA_OUTPUT_DIR=./output
CHEMDATA_MODEL_DIR=./models
# API credentials
REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_client_secret
TWITTER_API_KEY=your_api_key
TWITTER_API_SECRET=your_api_secret
# Database
POSTGRES_USER=chemdata
POSTGRES_PASSWORD=chemdata
POSTGRES_DB=chemdata
POSTGRES_HOST=postgres
# Redis
REDIS_HOST=redis
REDIS_PORT=6379
# Web server
FLASK_APP=binding_data_processor.web.app
FLASK_ENV=development
FLASK_DEBUG=1web: Web application and APIworker: Background task workerredis: Cache and message brokerpostgres: Databasedev: Development environmenttest: Test runner
See CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see LICENSE for details.
- BindingDB for providing compound data
- ChEMBL for their comprehensive API
- RDKit team for cheminformatics tools
- Open source community for various libraries used
- PsychonautWiki and Erowid for community data
- Swiss Institute of Bioinformatics for web services
- Patent offices for making data publicly accessible
If you use this software in your research, please cite:
@software{chemdata2024,
author = {anomium},
title = {ChemData: A Comprehensive Pipeline for Psychoactive Compound Analysis},
year = {2024},
url = {https://github.com/anomium/chemdata}
}