Skip to content

wirthal1990-tech/USDA-Phytochemical-Database-JSON

Repository files navigation

Ethno-API v2.4.0 — phytochemical data rescue, RAG-ready exports, public dataset metrics, and no medical claims

Ethno-API — USDA Phytochemical & Ethnobotanical Database, Enriched v2.4.0

Ethno-API v2.4.0 is a cleaned and enriched phytochemical data-engineering project derived from the USDA Dr. Duke source data. It contains 76,907 records, 2,313 plant species, 24,746 unique chemical entities, and a 16-field public schema with PubMed, ClinicalTrials.gov, ChEMBL, PatentsView, PubChem CID/SMILES, and partner-assisted CID/IUPAC resolution fields.
No medical claims. No pharmaceutical validation. No safety or efficacy claims. Research and retrieval use only.

Zenodo DOI Hugging Face Dataset Sample license: CC BY 4.0


Public Facts

Metric Value
Dataset version v2.4.0
Records 76,907
Plant species 2,313
Unique chemical entities 24,746
Public schema fields 16
Partner-assisted CID/IUPAC resolution records 1,197
Public sample format JSON, Parquet
Public sample size 400 rows
DOI 10.5281/zenodo.19660107

What This Is

Ethno-API is a machine-readable, enriched version of the USDA Dr. Duke phytochemical and ethnobotanical data, flattened into a practical JSON/Parquet structure for:

  • data rescue and normalization demonstrations
  • retrieval / RAG ingestion experiments
  • phytochemical and natural-products data prototypes
  • QA-gated identifier workflows
  • data-product portfolio proof for client projects

It is not a medical product, not a clinical validation layer, and not evidence that any compound is safe, effective, therapeutically useful, or suitable for use.

Scope Boundary

Area What it means
Data rescue Source records were normalized into a flatter analytical structure.
Enrichment External identifier and evidence-proxy fields were added where available.
QA gating Identifier consistency checks were applied to reduce obvious export noise.
RAG-readiness The data is structured for retrieval experiments and vector-database ingestion.
Scientific review scope Dominic Fagan (BSc Chemistry) contributed scientific review and partner-assisted identifier resolution.
Explicit limitation This does not validate biological activity, safety, efficacy, dosage relevance, clinical utility, the RAG bridge, or the reverse-SMILES QA system.

Distribution

Channel URL
Website https://ethno-api.com
GitHub https://github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON
Hugging Face https://huggingface.co/datasets/wirthal1990-tech/USDA-Phytochemical-Database-JSON
Kaggle https://www.kaggle.com/datasets/alexanderwirth/usda-phytochemical-database-json
Zenodo DOI https://doi.org/10.5281/zenodo.19660107

Public Schema — v2.4.0, 16 Fields

Field Type Coverage Notes
chemical string 100% USDA compound label, normalized for flat-table use
plant_species string 100% Latin binomial species name
application string / null partial Source application / activity context where present
dosage string / null partial Source dosage or concentration text where present; not usage guidance
pubmed_mentions_2026 integer enrichment layer PubMed mention-count snapshot
clinical_trials_count_2026 integer enrichment layer ClinicalTrials.gov study-count snapshot; not clinical validation
chembl_bioactivity_count integer enrichment layer ChEMBL bioactivity measurement count
patent_count_since_2020 integer / float enrichment layer PatentsView / patent-density feature
pubchem_cid integer / null approx. 75–82% depending on export/audit view PubChem CID from enrichment pipeline
canonical_smiles string / null 57,757 records with SMILES Canonical SMILES retrieved via PubChem
compound_type string 100% Compound classification used for filtering
patent_count_method string 100% Method label for patent-count derivation
partner_cid integer / null 1,197 records Partner-assisted PubChem CID resolution
inchi_key string / null subset Partner-assisted InChIKey / identifier resolution
iupac_verified string / bool / null subset Partner-assisted identifier-verification state
partner_match_method string / null subset Match method used in partner-resolution file

Partner-resolution file: exports/iupac_cid_resolutions.json — 1,197 partner-assisted CID/IUPAC resolution records.


QA Pipeline

  1. Normalize USDA source records into a flat analytical schema.
  2. Add external enrichment layers from PubMed, ClinicalTrials.gov, ChEMBL, PatentsView, and PubChem.
  3. Retrieve PubChem CID and canonical SMILES where available.
  4. Add compound classification and patent-count method fields.
  5. Add partner-assisted CID/IUPAC resolution subset.
  6. Run reverse-SMILES QA as a downstream identifier-consistency gate.
  7. Export JSON and Parquet samples for analysis and retrieval experiments.

Reverse-SMILES QA Audit — v2.4.0

Verdict Count Interpretation
validated 11,981 Strict round-trip pass
plausible 8,370 Pass with caveats
review_required 37,361 Visible but not auto-trusted
invalidated 45 Failed validation; excluded from default retrieval-ready export
insufficient_data 19,150 No SMILES available
Total 76,907 Full v2.4.0 input set

Default retrieval-ready rule: exclude invalidated and insufficient_data. Keep review_required visible, but do not auto-trust it.

Export-eligible records by the stated QA rule: 57,712
validated + plausible + review_required = 11,981 + 8,370 + 37,361

Known v2.4.0 QA fixes include thiol-false-alcohol detection, non-carboxylic-acid classifier correction, and strictest-verdict-wins CID tainting logic.


Repository Contents

/
├── README.md
├── METHODOLOGY.md
├── MANIFEST.md
├── LICENSE
├── assets/
│   └── ethno-api-readme-banner.svg
├── exports/
│   ├── ethno_api_v2.4.0_sample.json
│   ├── ethno_api_v2.4.0_sample.parquet
│   ├── ethno_api_schema.json
│   └── iupac_cid_resolutions.json
└── notebooks/
    └── quickstart.ipynb

Quickstart

import pandas as pd

df = pd.read_parquet("exports/ethno_api_v2.4.0_sample.parquet")

print(df.shape)
print(df["chemical"].nunique())
print(df["plant_species"].nunique())

preview_columns = [
    "chemical",
    "plant_species",
    "pubchem_cid",
    "canonical_smiles",
]

print(df[preview_columns].head())

The public repository contains sample exports and documentation. The full licensed export package is distributed through Ethno-API.


Use Cases

Use case Fit
Retrieval experiments Build test corpora for vector search and RAG pipelines.
Data cleaning demos Show normalization, enrichment, and QA-gating workflows.
Natural-products data prototypes Explore structured phytochemical source data.
Portfolio proof Demonstrate data rescue → enrichment → QA → export architecture.
Review workflows Inspect identifier-resolution and QA-gate logic on a bounded public sample.

Limitations — Read Before Use

  • No pharmaceutical validation. This dataset does not confirm biological activity, safety, efficacy, dosage relevance, or clinical utility.
  • No medical claims. Nothing in this repository constitutes medical advice, treatment guidance, product guidance, or usage recommendation.
  • Source dependency. Baseline relationships reflect USDA/source-database records and may contain historical terminology, sparse annotations, or context limitations.
  • Coverage gaps. Not all records have PubChem CID, SMILES, InChIKey, dosage/source-concentration text, or activity context.
  • Partner validation is scoped. The 1,197 partner-assisted records are identifier-resolution work, not full pharmacological validation.
  • QA verdicts are not clinical verdicts. Reverse-SMILES QA checks identifier consistency, not biological truth.
  • Snapshot. v2.4.0 reflects a point-in-time enrichment snapshot. External databases can change.

Why This Exists

Ethno-API was built as a proof-of-concept for AI-orchestrated data productisation: taking messy public scientific source data, normalizing it, enriching it with modern identifiers and evidence-proxy fields, adding QA gates, and preparing it for retrieval workflows.

The domain is phytochemistry. The transferable capability is:

data rescue → enrichment → QA gating → retrieval-ready export

Credits

Role Contributor
Data pipeline and project architecture Alexander Wirth
Scientific review / partner-resolution contribution Dominic Fagan (BSc Chemistry)

License

Public sample files in this repository: CC BY 4.0 unless otherwise stated.
Full commercial dataset: separate Ethno-API license terms.
Code snippets / methodology scripts: MIT only where explicitly marked.

Please attribute:

@misc{ethno_api_v24_2026,
  title     = {USDA Phytochemical & Ethnobotanical Database -- Enriched v2.4.0},
  author    = {Wirth, Alexander},
  year      = {2026},
  publisher = {Ethno-API},
  url       = {https://ethno-api.com},
  doi       = {10.5281/zenodo.19660107},
  note      = {76,907 records, 24,746 unique chemical entities, 2,313 plant species}
}

About

76,907 phytochemical records enriched with PubMed, ClinicalTrials.gov, ChEMBL bioactivity & USPTO patents. Production-ready JSON + Parquet. Free 400-row sample. Full dataset: ethno-api.com

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors