Skip to content

Consolidate chemical mappings into unified utility#530

Merged
realmarcin merged 7 commits intomasterfrom
chemical_mappings
Mar 22, 2026
Merged

Consolidate chemical mappings into unified utility#530
realmarcin merged 7 commits intomasterfrom
chemical_mappings

Conversation

@realmarcin
Copy link
Copy Markdown
Collaborator

Summary

Consolidates 6 distributed chemical mapping files into a single unified resource (mappings/unified_chemical_mappings.tsv.gz) and migrates all transforms to use a shared utility for chemical entity lookups.

Key Changes

  • Created kg_microbe/utils/chemical_mapping_utils.py with ChemicalMappingLoader class

    • Functions: find_chebi_by_name(), find_chebi_by_formula(), find_chebi_by_xref()
    • Module-level caching for performance
    • 43 comprehensive unit tests
  • Migrated 4 transforms to use unified mappings:

    • BacDive: Replaced metabolite_mapping.json (197 entries)
    • MediaDive: Replaced compound_mappings_strict*.tsv (2 files)
    • CTD: Replaced chebi_xrefs.tsv (389K entries) + fixed CAS-RN lookup bug
    • Ontologies: Removed ChEBI xref generation code
  • Removed obsolete files:

    • data/raw/compound_mappings_strict.tsv
    • data/raw/compound_mappings_strict_hydrate.tsv
    • kg_microbe/transform_utils/bacdive/metabolite_mapping.json
    • kg_microbe/transform_utils/ontologies/xrefs/chebi_xrefs.tsv

Benefits

Single source of truth - 164,705 ChEBI entries in one file
Bug fix - CTD CAS-RN lookups now work (previously broken)
Consistent ChEBI resolution across all transforms
Improved maintainability - centralized chemical mapping
Backward compatibility - all transforms maintain legacy fallbacks
Zero regressions - all tests pass

Files Changed

New:

  • kg_microbe/utils/chemical_mapping_utils.py (390 lines)
  • tests/test_chemical_mapping_utils.py (367 lines, 43 tests)
  • scripts/cleanup_old_chemical_mappings.sh
  • CHEMICAL_MAPPING_MIGRATION_SUMMARY.md (full technical documentation)

Modified:

  • kg_microbe/transform_utils/bacdive/bacdive.py
  • kg_microbe/transform_utils/mediadive/mediadive.py
  • kg_microbe/transform_utils/ctd/ctd.py
  • kg_microbe/transform_utils/ontologies/ontologies_transform.py
  • kg_microbe/transform_utils/constants.py

Deleted:

  • kg_microbe/transform_utils/bacdive/metabolite_mapping.json

Total: 10 files changed, 1,244 insertions(+), 312 deletions(-)

Test Plan

  • Unit tests: 43/43 tests pass for chemical_mapping_utils.py
  • Transform tests: All pass (BacDive, MediaDive, CTD, Ontologies)
  • Code formatting: Black applied to all files
  • Linting: Ruff checks pass (1 pre-existing error in unrelated file)
  • Integration: All transforms successfully use unified mappings
  • Backward compatibility: Legacy fallbacks maintained
  • Bug fix verified: CTD CAS-RN lookups now work correctly

Additional Verification

# Test chemical mapping utility
poetry run pytest tests/test_chemical_mapping_utils.py -v

# Test transforms
poetry run pytest tests/test_transform_class.py -v

# Run quality checks
poetry run tox

Documentation

Full technical documentation available in CHEMICAL_MAPPING_MIGRATION_SUMMARY.md including:

  • Architecture details
  • Migration pattern for future transforms
  • Usage examples
  • Team collaboration summary

🤖 Generated with Claude Code

realmarcin and others added 3 commits March 16, 2026 19:46
Creates a single unified chemical mapping resource that consolidates all
KG-Microbe chemical mapping sources into mappings/unified_chemical_mappings.tsv.gz

Consolidated sources (6):
- mappings/chemical_mappings.tsv (KEGG/BacDive)
- data/raw/compound_mappings_strict*.tsv (MediaDive ingredients)
- kg_microbe/transform_utils/bacdive/metabolite_mapping.json
- kg_microbe/transform_utils/ontologies/xrefs/chebi_xrefs.tsv
- kg_microbe/transform_utils/madin_etal/chebi_manual_annotation.tsv

Results:
- 164,702 unique ChEBI IDs
- 461,443 total synonyms (including 459,958 from ChEBI ontology)
- 405,870 cross-references (KEGG, CAS, PubChem, etc.)
- Canonical names, formulas, and source attribution

Features:
- ChEBI ontology synonyms (IUPAC names, multilingual variants)
- Deduplication by ChEBI ID and normalized chemical name
- Comprehensive cross-reference collection
- Reproducible via scripts/consolidate_chemical_mappings.py

File stored as .tsv.gz (4.7 MB compressed, 53 MB uncompressed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Script now exports directly to .tsv.gz format
- Uses pandas compression='gzip' parameter
- Removed uncompressed .tsv file (53 MB)
- Output file: unified_chemical_mappings.tsv.gz (8.4 MB)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Migrated all transforms to use a single unified chemical mapping resource
(unified_chemical_mappings.tsv.gz) via a new shared utility, eliminating
duplicate mapping files and centralizing ChEBI ID lookups.

Changes:
- Created kg_microbe/utils/chemical_mapping_utils.py with ChemicalMappingLoader class
- Added 43 comprehensive unit tests
- Migrated BacDive transform to use unified mappings with legacy fallback
- Migrated MediaDive transform to use unified mappings with legacy fallback
- Migrated CTD transform to use unified mappings (fixed CAS-RN lookup bug)
- Removed ChEBI xref generation from Ontologies transform
- Removed obsolete mapping files:
  * data/raw/compound_mappings_strict.tsv
  * data/raw/compound_mappings_strict_hydrate.tsv
  * kg_microbe/transform_utils/bacdive/metabolite_mapping.json
  * kg_microbe/transform_utils/ontologies/xrefs/chebi_xrefs.tsv

Benefits:
- Single source of truth for chemical mappings (164,705 ChEBI entries)
- Fixed CTD CAS-RN lookups (previously broken due to prefix mismatch)
- Consistent ChEBI ID resolution across all transforms
- Improved maintainability and reduced code duplication
- All transforms maintain backward compatibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR centralizes chemical entity resolution by introducing a shared ChemicalMappingLoader backed by a unified, gzipped mapping file, and migrates multiple transforms away from transform-specific mapping artifacts.

Changes:

  • Added kg_microbe/utils/chemical_mapping_utils.py with module-cached loading + lookup APIs (name/formula/xref) and a ChemicalMappingLoader wrapper.
  • Migrated BacDive, MediaDive, CTD, and Ontologies transform logic to use the shared loader (and removed legacy ChEBI xref generation in Ontologies).
  • Added consolidation/cleanup scripts and documentation describing the unified mapping resource, plus comprehensive unit tests.

Reviewed changes

Copilot reviewed 13 out of 15 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
kg_microbe/utils/chemical_mapping_utils.py Implements unified mapping loader + lookup APIs with module-level caching and indices.
tests/test_chemical_mapping_utils.py Adds unit tests covering normalization, caching, and lookup behaviors.
kg_microbe/transform_utils/bacdive/bacdive.py Uses unified mappings for name→ChEBI lookups with legacy fallback.
kg_microbe/transform_utils/mediadive/mediadive.py Replaces strict compound TSV lookups with unified chemical mapping lookups (plus legacy fallback).
kg_microbe/transform_utils/ctd/ctd.py Replaces ChEBI xref TSV loading with unified xref lookup for CAS→ChEBI.
kg_microbe/transform_utils/ontologies/ontologies_transform.py Removes ChEBI xref generation path; retains UPA/MONDO xref generation.
kg_microbe/transform_utils/constants.py Removes CHEBI_XREFS_FILEPATH constant and reformats several constants.
scripts/consolidate_chemical_mappings.py New script to build unified mapping artifact from multiple sources.
scripts/cleanup_old_chemical_mappings.sh Removes obsolete mapping files after migration.
mappings/README.md Documents unified chemical mapping resource (needs filename/example fixes).
mappings/CONSOLIDATION_SUMMARY.md Consolidation summary (needs filename/example fixes).
CHEMICAL_MAPPING_MIGRATION_SUMMARY.md Migration writeup and operational notes.
.gitignore Ignores uncompressed mappings/unified_chemical_mappings.tsv.
kg_microbe/transform_utils/bacdive/metabolite_mapping.json Deleted legacy BacDive metabolite mapping file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/consolidate_chemical_mappings.py Outdated
Comment thread kg_microbe/utils/chemical_mapping_utils.py Outdated
Comment thread kg_microbe/utils/chemical_mapping_utils.py
Comment thread mappings/README.md
Comment thread kg_microbe/utils/chemical_mapping_utils.py
Comment thread scripts/consolidate_chemical_mappings.py
Comment thread mappings/README.md Outdated
Comment thread mappings/CONSOLIDATION_SUMMARY.md Outdated
Comment thread kg_microbe/transform_utils/bacdive/bacdive.py
Comment thread scripts/consolidate_chemical_mappings.py Outdated
realmarcin and others added 4 commits March 21, 2026 21:29
Updated mappings/README.md and mappings/CONSOLIDATION_SUMMARY.md to correctly
reference unified_chemical_mappings.tsv.gz (gzipped) instead of .tsv.

Changes:
- Updated filename references to include .gz extension
- Fixed all usage examples to use gunzip -c for reading gzipped file
- Clarified that the unified mapping file is gzipped

Addresses Copilot review feedback on PR #530.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses all code quality issues identified in PR #530:

1. **consolidate_chemical_mappings.py**:
   - Fix docstring to specify .tsv.gz output format
   - Fix double prefix bug: kegg.compound:cpd:C11141 → kegg.compound:C11141
   - Fix CHEBI:CHEBI: double prefix in hydrate xrefs
   - Fix merge_duplicates_by_name() to actually delete merged entries

2. **chemical_mapping_utils.py**:
   - Fix caching to check mappings_path parameter
   - Optimize get_*() helpers from O(n) to O(1) using .loc
   - Add _CACHED_PATH global to track loaded file

3. **bacdive.py**:
   - Eliminate duplicate _lookup_chebi_by_name() calls
   - Cache ChEBI ID lookups in keyword processing loop

All tests pass (43/43 chemical_mapping_utils tests).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add blank line before BacDiveTransform class docstring
- Add blank line before ChemicalMappingLoader class docstring

Fixes D203 ruff rule violations in CI checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@realmarcin realmarcin merged commit 88d5b0f into master Mar 22, 2026
3 checks passed
@realmarcin realmarcin deleted the chemical_mappings branch March 22, 2026 06:37
realmarcin added a commit that referenced this pull request Apr 7, 2026
Incorporates upstream changes from master branch:
- MediaDive parallel download improvements (PR #527)
- Chemical mappings updates (PR #530)
- Add User-Agent header to MediaDive API requests
- Add session parameter to get_json_from_api for connection reuse

Merge conflict resolved in mediadive_bulk_download.py:
- Accepted incoming changes from master (added _make_session helper
  and session parameter to get_json_from_api)
- These changes enable proper User-Agent identification and improve
  connection reuse for parallel downloads

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants