GeneRx

Prescribing genes, not just drugs.

GeneRx is an open-source pipeline that answers a deceptively simple question in drug discovery:

For a given disease, which genes make the best therapeutic targets — and what drugs already hit them?

Instead of starting from a drug and asking what it treats, GeneRx starts from the gene. It cross-references three public biomedical databases to rank genes by their therapeutic potential, surface the drugs that modulate them, and present everything in an interactive dashboard — the way a precision medicine workflow should work.

The idea

Traditional drug discovery asks: "Does this molecule work?" GeneRx asks first: "Is this gene worth targeting?"

The pipeline integrates:

Open Targets — disease-gene association scores backed by genetics, literature, and clinical evidence
ChEMBL — bioactivity data: which compounds hit which proteins, how potently, and how far they've gone in clinical trials
UniProt — protein-level annotation: function, subcellular location, known disease links

The result is a ranked leaderboard of genes — scored by evidence, druggability, and pharmacological richness — with direct links to the drugs that target them.

How it works

Given one or more disease EFO IDs (e.g. EFO_0000400 for type 2 diabetes):

Fetch disease-target associations from Open Targets (GraphQL, paginated)
Resolve top genes to UniProt accessions
Enrich with ChEMBL targets, bioactivities, and clinical drug indications
Annotate proteins with UniProt functional data
Load raw records into DuckDB (bronze layer)
Normalize and join evidence into analytical entities (silver layer)
Compute composite biomarker scores and assign tiers (gold layer)
Explore results in a Streamlit dashboard

Features

Gene-centric drug discovery — start from the target, not the molecule
Composite scoring: OT evidence + druggability + pharmacological richness
Tier classification (Tier 1 / 2 / 3) for fast prioritization
Interactive dashboard: leaderboard, drug-potency scatter, clinical phase distribution, gene detail with radar chart
Retry-aware API clients with pagination and batching
Modular Python codebase — easy to extend with new data sources

Architecture

Main orchestrator: biomarker_pipeline/pipeline.py

Extract  →  Load bronze  →  Transform silver  →  Score gold  →  Dashboard

Stage	Module	Source
Extract	`extract/opentargets.py`	Open Targets GraphQL
Extract	`extract/chembl.py`	ChEMBL REST API
Extract	`extract/uniprot.py`	UniProt REST API
Load	`load/duckdb_loader.py`	DuckDB
Transform	`transform/normalizer.py`	DuckDB SQL
Score	`transform/scorer.py`	DuckDB SQL
Dashboard	`dashboard/app.py`	Streamlit + Plotly

Repository Layout

.
├─ biomarker_pipeline/
│  ├─ config.py           # API URLs, DuckDB path, scoring weights
│  ├─ models.py
│  ├─ pipeline.py         # Main orchestrator
│  ├─ requirements.txt
│  ├─ extract/
│  │  ├─ opentargets.py
│  │  ├─ chembl.py
│  │  └─ uniprot.py
│  ├─ load/
│  │  └─ duckdb_loader.py
│  ├─ transform/
│  │  ├─ normalizer.py
│  │  └─ scorer.py
│  └─ dashboard/
│     ├─ app.py
│     └─ queries.py
└─ README.md

Quickstart

Requirements

Python 3.10+ (3.11 recommended)
Internet access to Open Targets, ChEMBL, and UniProt

Install

cd biomarker_pipeline
python -m venv .venv

Activate:

Windows PowerShell: .venv\Scripts\Activate.ps1
macOS/Linux: source .venv/bin/activate

pip install -r requirements.txt

Run the pipeline

python pipeline.py --diseases EFO_0000400 EFO_0001378

This creates data/biomarker.duckdb with bronze, silver, and gold tables populated.

Run the dashboard

streamlit run dashboard/app.py

Open http://localhost:8501.

Configuration

Edit biomarker_pipeline/config.py:

Parameter	Description
`DUCKDB_PATH`	Path to the DuckDB file
`OT_MAX_TARGETS`	Max genes fetched from Open Targets
`CHEMBL_MAX_GENES`	Max genes enriched via ChEMBL
`MIN_PCHEMBL_ACTIVE`	Minimum pChEMBL to consider a compound active
`WEIGHT_OT_OVERALL`	Weight for OT score in composite (default 0.50)
`WEIGHT_DRUGGABILITY`	Weight for druggability score (default 0.30)
`WEIGHT_EVIDENCE_RICH`	Weight for evidence richness (default 0.20)

For faster local runs, reduce OT_MAX_TARGETS and CHEMBL_MAX_GENES.

Scoring

Composite score formula (transform/scorer.py):

composite = 0.50 × OT_score + 0.30 × druggability + 0.20 × evidence_richness

Tier thresholds:

Tier	Condition
Tier 1	`composite >= 0.7`
Tier 2	`composite >= 0.4`
Tier 3	otherwise

Data Schemas

DuckDB stores data in three layers:

Layer	Tables
`bronze`	`ot_associations`, `chembl_activities`, `chembl_drug_indications`, `chembl_targets`, `uniprot_proteins`
`silver`	`biomarker_candidates`, `drug_biomarker_links`
`gold`	`biomarker_scores`, `disease_summary`

DDL defined in load/duckdb_loader.py.

Troubleshooting

Dashboard says "Database not found"

Run the pipeline first.

API requests are slow or timeout

Reduce OT_MAX_TARGETS and CHEMBL_MAX_GENES in config.py, or retry later.

PowerShell blocks environment activation

Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass

Roadmap

Unit and integration tests
Structured logging and run metadata
API response caching
Improved cross-source entity matching (gene ID resolution)
CI pipeline
Export to CSV / Excel from dashboard

Contributing

Contributions welcome.

Fork the repository
Create a feature branch
Keep commits focused
Open a pull request with motivation, summary, and validation notes

For broad proposals, open an issue first.

License

Released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.vscode		.vscode
biomarker_pipeline		biomarker_pipeline
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
generx_logo.png		generx_logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeneRx

The idea

Contents

How it works

Features

Architecture

Repository Layout

Quickstart

Requirements

Install

Run the pipeline

Run the dashboard

Configuration

Scoring

Data Schemas

Troubleshooting

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GeneRx

The idea

Contents

How it works

Features

Architecture

Repository Layout

Quickstart

Requirements

Install

Run the pipeline

Run the dashboard

Configuration

Scoring

Data Schemas

Troubleshooting

Roadmap

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages