Skip to content

Add PyEnsemblAdapter as offline alternative to MartsAdapter #106

@jonasscheid

Description

@jonasscheid

Problem

The MartsAdapter relies on live HTTP queries to Ensembl BioMart, which is a known source of flakiness — server outages, rate limiting, and malformed responses (see #103). This is especially problematic for downstream tools like nf-core/epitopeprediction, where epaa.py depends on MartsAdapter for two calls:

  1. get_transcript_information() — CDS, strand, gene name retrieval (blocking — no peptides without it)
  2. get_protein_ids_from_transcripts() — transcript → protein/RefSeq/UniProt ID mapping (annotation)

While #103 proposes migrating to the Ensembl REST API, that still requires live network queries and is subject to rate limits (15 req/s).

Proposal

Add a PyEnsemblAdapter that implements ADBAdapter using PyEnsembl (by OpenVax). PyEnsembl downloads Ensembl GTF + FASTA files once and indexes them into a local SQLite database — all subsequent queries are entirely offline.

This covers the critical path:

ADBAdapter method PyEnsembl equivalent
get_transcript_information() transcript.coding_sequence, transcript.strand, gene.name
get_transcript_sequence() transcript.coding_sequence
get_product_sequence() genome.protein_sequence()

What it does not cover:

  • Cross-database ID mapping (Ensembl → RefSeq/UniProt) — but this is already being addressed in nf-core/epitopeprediction via VCF/VEP CSQ extraction

Advantages over REST/BioMart

  • No network dependency after initial download — eliminates the Fix CI, move to GH actions #1 failure mode
  • Reproducible — pinned to a specific Ensembl release
  • Fast — local SQLite queries instead of HTTP round-trips
  • Non-breaking — new adapter class alongside existing MartsAdapter

Scope

A new epytope/IO/PyEnsemblAdapter.py implementing the three ADBAdapter abstract methods:

  • get_product_sequence()
  • get_transcript_sequence()
  • get_transcript_information()

PyEnsembl would be an optional dependency (e.g. pip install epytope[pyensembl]).

Notes

  • PyEnsembl supports Ensembl releases 54–114, human + mouse + other species
  • Initial download is ~1–2 GB per release/species (one-time cost, cacheable in containers/CI)
  • Users need to match Ensembl release to genome build (e.g. GRCh37 → release 75, GRCh38 → release 110+)
  • This complements Consider migrating MartsAdapter from BioMart to Ensembl REST API #103 — PyEnsembl for the offline-capable subset, REST API for cross-references if needed

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions