Consolidate chemical mappings into unified utility#530
Merged
realmarcin merged 7 commits intomasterfrom Mar 22, 2026
Merged
Conversation
Creates a single unified chemical mapping resource that consolidates all KG-Microbe chemical mapping sources into mappings/unified_chemical_mappings.tsv.gz Consolidated sources (6): - mappings/chemical_mappings.tsv (KEGG/BacDive) - data/raw/compound_mappings_strict*.tsv (MediaDive ingredients) - kg_microbe/transform_utils/bacdive/metabolite_mapping.json - kg_microbe/transform_utils/ontologies/xrefs/chebi_xrefs.tsv - kg_microbe/transform_utils/madin_etal/chebi_manual_annotation.tsv Results: - 164,702 unique ChEBI IDs - 461,443 total synonyms (including 459,958 from ChEBI ontology) - 405,870 cross-references (KEGG, CAS, PubChem, etc.) - Canonical names, formulas, and source attribution Features: - ChEBI ontology synonyms (IUPAC names, multilingual variants) - Deduplication by ChEBI ID and normalized chemical name - Comprehensive cross-reference collection - Reproducible via scripts/consolidate_chemical_mappings.py File stored as .tsv.gz (4.7 MB compressed, 53 MB uncompressed) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Script now exports directly to .tsv.gz format - Uses pandas compression='gzip' parameter - Removed uncompressed .tsv file (53 MB) - Output file: unified_chemical_mappings.tsv.gz (8.4 MB) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Migrated all transforms to use a single unified chemical mapping resource (unified_chemical_mappings.tsv.gz) via a new shared utility, eliminating duplicate mapping files and centralizing ChEBI ID lookups. Changes: - Created kg_microbe/utils/chemical_mapping_utils.py with ChemicalMappingLoader class - Added 43 comprehensive unit tests - Migrated BacDive transform to use unified mappings with legacy fallback - Migrated MediaDive transform to use unified mappings with legacy fallback - Migrated CTD transform to use unified mappings (fixed CAS-RN lookup bug) - Removed ChEBI xref generation from Ontologies transform - Removed obsolete mapping files: * data/raw/compound_mappings_strict.tsv * data/raw/compound_mappings_strict_hydrate.tsv * kg_microbe/transform_utils/bacdive/metabolite_mapping.json * kg_microbe/transform_utils/ontologies/xrefs/chebi_xrefs.tsv Benefits: - Single source of truth for chemical mappings (164,705 ChEBI entries) - Fixed CTD CAS-RN lookups (previously broken due to prefix mismatch) - Consistent ChEBI ID resolution across all transforms - Improved maintainability and reduced code duplication - All transforms maintain backward compatibility Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR centralizes chemical entity resolution by introducing a shared ChemicalMappingLoader backed by a unified, gzipped mapping file, and migrates multiple transforms away from transform-specific mapping artifacts.
Changes:
- Added
kg_microbe/utils/chemical_mapping_utils.pywith module-cached loading + lookup APIs (name/formula/xref) and aChemicalMappingLoaderwrapper. - Migrated BacDive, MediaDive, CTD, and Ontologies transform logic to use the shared loader (and removed legacy ChEBI xref generation in Ontologies).
- Added consolidation/cleanup scripts and documentation describing the unified mapping resource, plus comprehensive unit tests.
Reviewed changes
Copilot reviewed 13 out of 15 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
kg_microbe/utils/chemical_mapping_utils.py |
Implements unified mapping loader + lookup APIs with module-level caching and indices. |
tests/test_chemical_mapping_utils.py |
Adds unit tests covering normalization, caching, and lookup behaviors. |
kg_microbe/transform_utils/bacdive/bacdive.py |
Uses unified mappings for name→ChEBI lookups with legacy fallback. |
kg_microbe/transform_utils/mediadive/mediadive.py |
Replaces strict compound TSV lookups with unified chemical mapping lookups (plus legacy fallback). |
kg_microbe/transform_utils/ctd/ctd.py |
Replaces ChEBI xref TSV loading with unified xref lookup for CAS→ChEBI. |
kg_microbe/transform_utils/ontologies/ontologies_transform.py |
Removes ChEBI xref generation path; retains UPA/MONDO xref generation. |
kg_microbe/transform_utils/constants.py |
Removes CHEBI_XREFS_FILEPATH constant and reformats several constants. |
scripts/consolidate_chemical_mappings.py |
New script to build unified mapping artifact from multiple sources. |
scripts/cleanup_old_chemical_mappings.sh |
Removes obsolete mapping files after migration. |
mappings/README.md |
Documents unified chemical mapping resource (needs filename/example fixes). |
mappings/CONSOLIDATION_SUMMARY.md |
Consolidation summary (needs filename/example fixes). |
CHEMICAL_MAPPING_MIGRATION_SUMMARY.md |
Migration writeup and operational notes. |
.gitignore |
Ignores uncompressed mappings/unified_chemical_mappings.tsv. |
kg_microbe/transform_utils/bacdive/metabolite_mapping.json |
Deleted legacy BacDive metabolite mapping file. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Updated mappings/README.md and mappings/CONSOLIDATION_SUMMARY.md to correctly reference unified_chemical_mappings.tsv.gz (gzipped) instead of .tsv. Changes: - Updated filename references to include .gz extension - Fixed all usage examples to use gunzip -c for reading gzipped file - Clarified that the unified mapping file is gzipped Addresses Copilot review feedback on PR #530. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses all code quality issues identified in PR #530: 1. **consolidate_chemical_mappings.py**: - Fix docstring to specify .tsv.gz output format - Fix double prefix bug: kegg.compound:cpd:C11141 → kegg.compound:C11141 - Fix CHEBI:CHEBI: double prefix in hydrate xrefs - Fix merge_duplicates_by_name() to actually delete merged entries 2. **chemical_mapping_utils.py**: - Fix caching to check mappings_path parameter - Optimize get_*() helpers from O(n) to O(1) using .loc - Add _CACHED_PATH global to track loaded file 3. **bacdive.py**: - Eliminate duplicate _lookup_chebi_by_name() calls - Cache ChEBI ID lookups in keyword processing loop All tests pass (43/43 chemical_mapping_utils tests). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add blank line before BacDiveTransform class docstring - Add blank line before ChemicalMappingLoader class docstring Fixes D203 ruff rule violations in CI checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
realmarcin
added a commit
that referenced
this pull request
Apr 7, 2026
Incorporates upstream changes from master branch: - MediaDive parallel download improvements (PR #527) - Chemical mappings updates (PR #530) - Add User-Agent header to MediaDive API requests - Add session parameter to get_json_from_api for connection reuse Merge conflict resolved in mediadive_bulk_download.py: - Accepted incoming changes from master (added _make_session helper and session parameter to get_json_from_api) - These changes enable proper User-Agent identification and improve connection reuse for parallel downloads Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidates 6 distributed chemical mapping files into a single unified resource (
mappings/unified_chemical_mappings.tsv.gz) and migrates all transforms to use a shared utility for chemical entity lookups.Key Changes
Created
kg_microbe/utils/chemical_mapping_utils.pywithChemicalMappingLoaderclassfind_chebi_by_name(),find_chebi_by_formula(),find_chebi_by_xref()Migrated 4 transforms to use unified mappings:
metabolite_mapping.json(197 entries)compound_mappings_strict*.tsv(2 files)chebi_xrefs.tsv(389K entries) + fixed CAS-RN lookup bugRemoved obsolete files:
data/raw/compound_mappings_strict.tsvdata/raw/compound_mappings_strict_hydrate.tsvkg_microbe/transform_utils/bacdive/metabolite_mapping.jsonkg_microbe/transform_utils/ontologies/xrefs/chebi_xrefs.tsvBenefits
✅ Single source of truth - 164,705 ChEBI entries in one file
✅ Bug fix - CTD CAS-RN lookups now work (previously broken)
✅ Consistent ChEBI resolution across all transforms
✅ Improved maintainability - centralized chemical mapping
✅ Backward compatibility - all transforms maintain legacy fallbacks
✅ Zero regressions - all tests pass
Files Changed
New:
kg_microbe/utils/chemical_mapping_utils.py(390 lines)tests/test_chemical_mapping_utils.py(367 lines, 43 tests)scripts/cleanup_old_chemical_mappings.shCHEMICAL_MAPPING_MIGRATION_SUMMARY.md(full technical documentation)Modified:
kg_microbe/transform_utils/bacdive/bacdive.pykg_microbe/transform_utils/mediadive/mediadive.pykg_microbe/transform_utils/ctd/ctd.pykg_microbe/transform_utils/ontologies/ontologies_transform.pykg_microbe/transform_utils/constants.pyDeleted:
kg_microbe/transform_utils/bacdive/metabolite_mapping.jsonTotal: 10 files changed, 1,244 insertions(+), 312 deletions(-)
Test Plan
chemical_mapping_utils.pyAdditional Verification
Documentation
Full technical documentation available in
CHEMICAL_MAPPING_MIGRATION_SUMMARY.mdincluding:🤖 Generated with Claude Code