Prescribing genes, not just drugs.
GeneRx is an open-source pipeline that answers a deceptively simple question in drug discovery:
For a given disease, which genes make the best therapeutic targets — and what drugs already hit them?
Instead of starting from a drug and asking what it treats, GeneRx starts from the gene. It cross-references three public biomedical databases to rank genes by their therapeutic potential, surface the drugs that modulate them, and present everything in an interactive dashboard — the way a precision medicine workflow should work.
Traditional drug discovery asks: "Does this molecule work?" GeneRx asks first: "Is this gene worth targeting?"
The pipeline integrates:
- Open Targets — disease-gene association scores backed by genetics, literature, and clinical evidence
- ChEMBL — bioactivity data: which compounds hit which proteins, how potently, and how far they've gone in clinical trials
- UniProt — protein-level annotation: function, subcellular location, known disease links
The result is a ranked leaderboard of genes — scored by evidence, druggability, and pharmacological richness — with direct links to the drugs that target them.
- How it works
- Features
- Architecture
- Repository Layout
- Quickstart
- Configuration
- Scoring
- Data Schemas
- Troubleshooting
- Roadmap
- Contributing
- License
Given one or more disease EFO IDs (e.g. EFO_0000400 for type 2 diabetes):
- Fetch disease-target associations from Open Targets (GraphQL, paginated)
- Resolve top genes to UniProt accessions
- Enrich with ChEMBL targets, bioactivities, and clinical drug indications
- Annotate proteins with UniProt functional data
- Load raw records into DuckDB (
bronzelayer) - Normalize and join evidence into analytical entities (
silverlayer) - Compute composite biomarker scores and assign tiers (
goldlayer) - Explore results in a Streamlit dashboard
- Gene-centric drug discovery — start from the target, not the molecule
- Composite scoring: OT evidence + druggability + pharmacological richness
- Tier classification (Tier 1 / 2 / 3) for fast prioritization
- Interactive dashboard: leaderboard, drug-potency scatter, clinical phase distribution, gene detail with radar chart
- Retry-aware API clients with pagination and batching
- Modular Python codebase — easy to extend with new data sources
Main orchestrator: biomarker_pipeline/pipeline.py
Extract → Load bronze → Transform silver → Score gold → Dashboard
| Stage | Module | Source |
|---|---|---|
| Extract | extract/opentargets.py |
Open Targets GraphQL |
| Extract | extract/chembl.py |
ChEMBL REST API |
| Extract | extract/uniprot.py |
UniProt REST API |
| Load | load/duckdb_loader.py |
DuckDB |
| Transform | transform/normalizer.py |
DuckDB SQL |
| Score | transform/scorer.py |
DuckDB SQL |
| Dashboard | dashboard/app.py |
Streamlit + Plotly |
.
├─ biomarker_pipeline/
│ ├─ config.py # API URLs, DuckDB path, scoring weights
│ ├─ models.py
│ ├─ pipeline.py # Main orchestrator
│ ├─ requirements.txt
│ ├─ extract/
│ │ ├─ opentargets.py
│ │ ├─ chembl.py
│ │ └─ uniprot.py
│ ├─ load/
│ │ └─ duckdb_loader.py
│ ├─ transform/
│ │ ├─ normalizer.py
│ │ └─ scorer.py
│ └─ dashboard/
│ ├─ app.py
│ └─ queries.py
└─ README.md
- Python 3.10+ (3.11 recommended)
- Internet access to Open Targets, ChEMBL, and UniProt
cd biomarker_pipeline
python -m venv .venvActivate:
- Windows PowerShell:
.venv\Scripts\Activate.ps1 - macOS/Linux:
source .venv/bin/activate
pip install -r requirements.txtpython pipeline.py --diseases EFO_0000400 EFO_0001378This creates data/biomarker.duckdb with bronze, silver, and gold tables populated.
streamlit run dashboard/app.pyOpen http://localhost:8501.
Edit biomarker_pipeline/config.py:
| Parameter | Description |
|---|---|
DUCKDB_PATH |
Path to the DuckDB file |
OT_MAX_TARGETS |
Max genes fetched from Open Targets |
CHEMBL_MAX_GENES |
Max genes enriched via ChEMBL |
MIN_PCHEMBL_ACTIVE |
Minimum pChEMBL to consider a compound active |
WEIGHT_OT_OVERALL |
Weight for OT score in composite (default 0.50) |
WEIGHT_DRUGGABILITY |
Weight for druggability score (default 0.30) |
WEIGHT_EVIDENCE_RICH |
Weight for evidence richness (default 0.20) |
For faster local runs, reduce OT_MAX_TARGETS and CHEMBL_MAX_GENES.
Composite score formula (transform/scorer.py):
composite = 0.50 × OT_score + 0.30 × druggability + 0.20 × evidence_richness
Tier thresholds:
| Tier | Condition |
|---|---|
| Tier 1 | composite >= 0.7 |
| Tier 2 | composite >= 0.4 |
| Tier 3 | otherwise |
DuckDB stores data in three layers:
| Layer | Tables |
|---|---|
bronze |
ot_associations, chembl_activities, chembl_drug_indications, chembl_targets, uniprot_proteins |
silver |
biomarker_candidates, drug_biomarker_links |
gold |
biomarker_scores, disease_summary |
DDL defined in load/duckdb_loader.py.
Dashboard says "Database not found"
Run the pipeline first.
API requests are slow or timeout
Reduce OT_MAX_TARGETS and CHEMBL_MAX_GENES in config.py, or retry later.
PowerShell blocks environment activation
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass- Unit and integration tests
- Structured logging and run metadata
- API response caching
- Improved cross-source entity matching (gene ID resolution)
- CI pipeline
- Export to CSV / Excel from dashboard
Contributions welcome.
- Fork the repository
- Create a feature branch
- Keep commits focused
- Open a pull request with motivation, summary, and validation notes
For broad proposals, open an issue first.
Released under the MIT License.
