Skip to content

bmascat/generx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GeneRx logo

GeneRx

Prescribing genes, not just drugs.

Python DuckDB Streamlit MIT License

GeneRx is an open-source pipeline that answers a deceptively simple question in drug discovery:

For a given disease, which genes make the best therapeutic targets — and what drugs already hit them?

Instead of starting from a drug and asking what it treats, GeneRx starts from the gene. It cross-references three public biomedical databases to rank genes by their therapeutic potential, surface the drugs that modulate them, and present everything in an interactive dashboard — the way a precision medicine workflow should work.


The idea

Traditional drug discovery asks: "Does this molecule work?" GeneRx asks first: "Is this gene worth targeting?"

The pipeline integrates:

  • Open Targets — disease-gene association scores backed by genetics, literature, and clinical evidence
  • ChEMBL — bioactivity data: which compounds hit which proteins, how potently, and how far they've gone in clinical trials
  • UniProt — protein-level annotation: function, subcellular location, known disease links

The result is a ranked leaderboard of genes — scored by evidence, druggability, and pharmacological richness — with direct links to the drugs that target them.


Contents


How it works

Given one or more disease EFO IDs (e.g. EFO_0000400 for type 2 diabetes):

  1. Fetch disease-target associations from Open Targets (GraphQL, paginated)
  2. Resolve top genes to UniProt accessions
  3. Enrich with ChEMBL targets, bioactivities, and clinical drug indications
  4. Annotate proteins with UniProt functional data
  5. Load raw records into DuckDB (bronze layer)
  6. Normalize and join evidence into analytical entities (silver layer)
  7. Compute composite biomarker scores and assign tiers (gold layer)
  8. Explore results in a Streamlit dashboard

Features

  • Gene-centric drug discovery — start from the target, not the molecule
  • Composite scoring: OT evidence + druggability + pharmacological richness
  • Tier classification (Tier 1 / 2 / 3) for fast prioritization
  • Interactive dashboard: leaderboard, drug-potency scatter, clinical phase distribution, gene detail with radar chart
  • Retry-aware API clients with pagination and batching
  • Modular Python codebase — easy to extend with new data sources

Architecture

Main orchestrator: biomarker_pipeline/pipeline.py

Extract  →  Load bronze  →  Transform silver  →  Score gold  →  Dashboard
Stage Module Source
Extract extract/opentargets.py Open Targets GraphQL
Extract extract/chembl.py ChEMBL REST API
Extract extract/uniprot.py UniProt REST API
Load load/duckdb_loader.py DuckDB
Transform transform/normalizer.py DuckDB SQL
Score transform/scorer.py DuckDB SQL
Dashboard dashboard/app.py Streamlit + Plotly

Repository Layout

.
├─ biomarker_pipeline/
│  ├─ config.py           # API URLs, DuckDB path, scoring weights
│  ├─ models.py
│  ├─ pipeline.py         # Main orchestrator
│  ├─ requirements.txt
│  ├─ extract/
│  │  ├─ opentargets.py
│  │  ├─ chembl.py
│  │  └─ uniprot.py
│  ├─ load/
│  │  └─ duckdb_loader.py
│  ├─ transform/
│  │  ├─ normalizer.py
│  │  └─ scorer.py
│  └─ dashboard/
│     ├─ app.py
│     └─ queries.py
└─ README.md

Quickstart

Requirements

  • Python 3.10+ (3.11 recommended)
  • Internet access to Open Targets, ChEMBL, and UniProt

Install

cd biomarker_pipeline
python -m venv .venv

Activate:

  • Windows PowerShell: .venv\Scripts\Activate.ps1
  • macOS/Linux: source .venv/bin/activate
pip install -r requirements.txt

Run the pipeline

python pipeline.py --diseases EFO_0000400 EFO_0001378

This creates data/biomarker.duckdb with bronze, silver, and gold tables populated.

Run the dashboard

streamlit run dashboard/app.py

Open http://localhost:8501.


Configuration

Edit biomarker_pipeline/config.py:

Parameter Description
DUCKDB_PATH Path to the DuckDB file
OT_MAX_TARGETS Max genes fetched from Open Targets
CHEMBL_MAX_GENES Max genes enriched via ChEMBL
MIN_PCHEMBL_ACTIVE Minimum pChEMBL to consider a compound active
WEIGHT_OT_OVERALL Weight for OT score in composite (default 0.50)
WEIGHT_DRUGGABILITY Weight for druggability score (default 0.30)
WEIGHT_EVIDENCE_RICH Weight for evidence richness (default 0.20)

For faster local runs, reduce OT_MAX_TARGETS and CHEMBL_MAX_GENES.


Scoring

Composite score formula (transform/scorer.py):

composite = 0.50 × OT_score + 0.30 × druggability + 0.20 × evidence_richness

Tier thresholds:

Tier Condition
Tier 1 composite >= 0.7
Tier 2 composite >= 0.4
Tier 3 otherwise

Data Schemas

DuckDB stores data in three layers:

Layer Tables
bronze ot_associations, chembl_activities, chembl_drug_indications, chembl_targets, uniprot_proteins
silver biomarker_candidates, drug_biomarker_links
gold biomarker_scores, disease_summary

DDL defined in load/duckdb_loader.py.


Troubleshooting

Dashboard says "Database not found"

Run the pipeline first.

API requests are slow or timeout

Reduce OT_MAX_TARGETS and CHEMBL_MAX_GENES in config.py, or retry later.

PowerShell blocks environment activation

Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass

Roadmap

  • Unit and integration tests
  • Structured logging and run metadata
  • API response caching
  • Improved cross-source entity matching (gene ID resolution)
  • CI pipeline
  • Export to CSV / Excel from dashboard

Contributing

Contributions welcome.

  1. Fork the repository
  2. Create a feature branch
  3. Keep commits focused
  4. Open a pull request with motivation, summary, and validation notes

For broad proposals, open an issue first.


License

Released under the MIT License.

About

Prescribing genes, not just drugs. ETL + dashboard pipeline that cross-references Open Targets, ChEMBL and UniProt to rank therapeutic targets by disease.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages