feat(g2p): add gget g2p module for the Genomics 2 Proteins portal (#138)#220
Merged
Merged
Conversation
New module querying the Genomics 2 Proteins (G2P) portal (https://g2p.broadinstitute.org/) for residue-level protein structure/function annotations. The API serves TSV, parsed into a pandas DataFrame. - gget.g2p(gene, uniprot_id, resource='features'|'map'|'alignment', isoform=None, save=False, verbose=True) - 'features': per-residue table (AlphaFold pLDDT, UniProt sites, pockets, PTMs); 'map': gene->transcript->isoform->structure identifiers; 'alignment': residue-level isoform alignment - CLI parser + dispatch in main.py, export in __init__.py, G2P_API in constants.py - Tests (live integration + network-free validation) + docs (g2p.md, updates.md) Resolves scverse#138 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lauraluebbert
added a commit
that referenced
this pull request
Jun 21, 2026
Dev -> main: scverse packaging modernization (#215), gget g2p module (#220), scverse URL migration (#219), tests/coverage badges, CI consolidation (single pytest_results.txt, dynamic latest-Python gating), gget search NaN→None normalization, gget mutate pyarrow-empty-slice guard, scanpy>=1.10 pin in the cellxgene extra, gdrive backup overwrite, and updates.md notes. Conflicts resolved: - README.md: kept dev's dynamic tests/coverage badges; dropped the stale static `Coverage-83%` badge. - docs/src/en/introduction.md: removed duplicate `# Welcome!` heading and whitespace-only conflict block. - docs/src/es/introduction.md: same — removed duplicate `# ¡Bienvenidos!`. - tests/pytest_results_py3.12.txt: accepted dev's delete (superseded by tests/pytest_results.txt under the new CI consolidation).
lauraluebbert
added a commit
that referenced
this pull request
Jun 21, 2026
PR #220 inserted a new import block before the existing alphabetical list without removing the originals further down, leaving six modules (alphafold/archs4/enrichr/gpt/pdb/setup) imported twice. Harmless at runtime (Python deduplicates), but trips ruff F811 once the pre-commit hooks land. Restore alphabetical order with g2p in its slot.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves #138. Adds a new module
gget g2pthat queries the Genomics 2 Proteins (G2P) portal (Broad Institute; Kwon, Safer, Nguyen et al., Nature Methods 2024) to link genes/proteins to residue-level structural and functional annotations.The G2P REST API serves tab-separated values, which the module parses into a pandas DataFrame — fitting gget's "query a database → DataFrame in one line" idiom (cf.
gget pdb,gget bgee).What it does
gget.g2p(gene, uniprot_id, resource="features"|"map"|"alignment", isoform=None, save=False, verbose=True)features(default): per-residue feature table (AlphaFold pLDDT, UniProt sites, secondary structure, predicted pockets, PTMs, …) — 140+ columns.map: gene → transcript → protein isoform → structure map (UniProt / Ensembl / RefSeq / PDB identifiers).alignment: residue-level sequence alignment between two isoforms (requiresisoform;uniprot_idis the canonical isoform).Changes
gget/gget_g2p.py— new module (direct REST viarequests, TSV → DataFrame; no new heavy dependency, does not vendor theg2papiclient).gget/main.py—parser_g2p+ dispatch (positionalgene,-u/--uniprot_id,-r/--resource,-i/--isoform,-o/--out,-csv,-q).gget/__init__.py— exportg2p;gget/constants.py—G2P_API.tests/test_g2p.py— live integration tests (assert on stable columns / identifiers, since the feature table is wide and its values can change) + network-free argument-validation tests.docs/src/en/g2p.md+docs/src/en/updates.md.Testing
All tests pass locally (Python 3.11), and the CLI was exercised for all three resources:
Notes
g2papiREADME suggested JSON, but the API actually returns TSV — verified and handled accordingly).uniprot_idis required; the help text and error message point users togget infoto find it. Auto-resolving UniProt IDs from a gene symbol could be a follow-up.