Ethno-API — USDA Phytochemical & Ethnobotanical Database, Enriched v2.4.0

Ethno-API v2.4.0 is a cleaned and enriched phytochemical data-engineering project derived from the USDA Dr. Duke source data. It contains 76,907 records, 2,313 plant species, 24,746 unique chemical entities, and a 16-field public schema with PubMed, ClinicalTrials.gov, ChEMBL, PatentsView, PubChem CID/SMILES, and partner-assisted CID/IUPAC resolution fields.
No medical claims. No pharmaceutical validation. No safety or efficacy claims. Research and retrieval use only.

Public Facts

Metric	Value
Dataset version	v2.4.0
Records	76,907
Plant species	2,313
Unique chemical entities	24,746
Public schema fields	16
Partner-assisted CID/IUPAC resolution records	1,197
Public sample format	JSON, Parquet
Public sample size	400 rows
DOI	10.5281/zenodo.19660107

What This Is

Ethno-API is a machine-readable, enriched version of the USDA Dr. Duke phytochemical and ethnobotanical data, flattened into a practical JSON/Parquet structure for:

data rescue and normalization demonstrations
retrieval / RAG ingestion experiments
phytochemical and natural-products data prototypes
QA-gated identifier workflows
data-product portfolio proof for client projects

It is not a medical product, not a clinical validation layer, and not evidence that any compound is safe, effective, therapeutically useful, or suitable for use.

Scope Boundary

Area	What it means
Data rescue	Source records were normalized into a flatter analytical structure.
Enrichment	External identifier and evidence-proxy fields were added where available.
QA gating	Identifier consistency checks were applied to reduce obvious export noise.
RAG-readiness	The data is structured for retrieval experiments and vector-database ingestion.
Scientific review scope	Dominic Fagan (BSc Chemistry) contributed scientific review and partner-assisted identifier resolution.
Explicit limitation	This does not validate biological activity, safety, efficacy, dosage relevance, clinical utility, the RAG bridge, or the reverse-SMILES QA system.

Distribution

Channel	URL
Website	https://ethno-api.com
GitHub	https://github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON
Hugging Face	https://huggingface.co/datasets/wirthal1990-tech/USDA-Phytochemical-Database-JSON
Kaggle	https://www.kaggle.com/datasets/alexanderwirth/usda-phytochemical-database-json
Zenodo DOI	https://doi.org/10.5281/zenodo.19660107

Public Schema — v2.4.0, 16 Fields

Field	Type	Coverage	Notes
`chemical`	string	100%	USDA compound label, normalized for flat-table use
`plant_species`	string	100%	Latin binomial species name
`application`	string / null	partial	Source application / activity context where present
`dosage`	string / null	partial	Source dosage or concentration text where present; not usage guidance
`pubmed_mentions_2026`	integer	enrichment layer	PubMed mention-count snapshot
`clinical_trials_count_2026`	integer	enrichment layer	ClinicalTrials.gov study-count snapshot; not clinical validation
`chembl_bioactivity_count`	integer	enrichment layer	ChEMBL bioactivity measurement count
`patent_count_since_2020`	integer / float	enrichment layer	PatentsView / patent-density feature
`pubchem_cid`	integer / null	approx. 75–82% depending on export/audit view	PubChem CID from enrichment pipeline
`canonical_smiles`	string / null	57,757 records with SMILES	Canonical SMILES retrieved via PubChem
`compound_type`	string	100%	Compound classification used for filtering
`patent_count_method`	string	100%	Method label for patent-count derivation
`partner_cid`	integer / null	1,197 records	Partner-assisted PubChem CID resolution
`inchi_key`	string / null	subset	Partner-assisted InChIKey / identifier resolution
`iupac_verified`	string / bool / null	subset	Partner-assisted identifier-verification state
`partner_match_method`	string / null	subset	Match method used in partner-resolution file

Partner-resolution file: exports/iupac_cid_resolutions.json — 1,197 partner-assisted CID/IUPAC resolution records.

QA Pipeline

Normalize USDA source records into a flat analytical schema.
Add external enrichment layers from PubMed, ClinicalTrials.gov, ChEMBL, PatentsView, and PubChem.
Retrieve PubChem CID and canonical SMILES where available.
Add compound classification and patent-count method fields.
Add partner-assisted CID/IUPAC resolution subset.
Run reverse-SMILES QA as a downstream identifier-consistency gate.
Export JSON and Parquet samples for analysis and retrieval experiments.

Reverse-SMILES QA Audit — v2.4.0

Verdict	Count	Interpretation
`validated`	11,981	Strict round-trip pass
`plausible`	8,370	Pass with caveats
`review_required`	37,361	Visible but not auto-trusted
`invalidated`	45	Failed validation; excluded from default retrieval-ready export
`insufficient_data`	19,150	No SMILES available
Total	76,907	Full v2.4.0 input set

Default retrieval-ready rule: exclude invalidated and insufficient_data. Keep review_required visible, but do not auto-trust it.

Export-eligible records by the stated QA rule: 57,712
validated + plausible + review_required = 11,981 + 8,370 + 37,361

Known v2.4.0 QA fixes include thiol-false-alcohol detection, non-carboxylic-acid classifier correction, and strictest-verdict-wins CID tainting logic.

Repository Contents

/
├── README.md
├── METHODOLOGY.md
├── MANIFEST.md
├── LICENSE
├── assets/
│   └── ethno-api-readme-banner.svg
├── exports/
│   ├── ethno_api_v2.4.0_sample.json
│   ├── ethno_api_v2.4.0_sample.parquet
│   ├── ethno_api_schema.json
│   └── iupac_cid_resolutions.json
└── notebooks/
    └── quickstart.ipynb

Quickstart

import pandas as pd

df = pd.read_parquet("exports/ethno_api_v2.4.0_sample.parquet")

print(df.shape)
print(df["chemical"].nunique())
print(df["plant_species"].nunique())

preview_columns = [
    "chemical",
    "plant_species",
    "pubchem_cid",
    "canonical_smiles",
]

print(df[preview_columns].head())

The public repository contains sample exports and documentation. The full licensed export package is distributed through Ethno-API.

Use Cases

Use case	Fit
Retrieval experiments	Build test corpora for vector search and RAG pipelines.
Data cleaning demos	Show normalization, enrichment, and QA-gating workflows.
Natural-products data prototypes	Explore structured phytochemical source data.
Portfolio proof	Demonstrate data rescue → enrichment → QA → export architecture.
Review workflows	Inspect identifier-resolution and QA-gate logic on a bounded public sample.

Limitations — Read Before Use

No pharmaceutical validation. This dataset does not confirm biological activity, safety, efficacy, dosage relevance, or clinical utility.
No medical claims. Nothing in this repository constitutes medical advice, treatment guidance, product guidance, or usage recommendation.
Source dependency. Baseline relationships reflect USDA/source-database records and may contain historical terminology, sparse annotations, or context limitations.
Coverage gaps. Not all records have PubChem CID, SMILES, InChIKey, dosage/source-concentration text, or activity context.
Partner validation is scoped. The 1,197 partner-assisted records are identifier-resolution work, not full pharmacological validation.
QA verdicts are not clinical verdicts. Reverse-SMILES QA checks identifier consistency, not biological truth.
Snapshot. v2.4.0 reflects a point-in-time enrichment snapshot. External databases can change.

Why This Exists

Ethno-API was built as a proof-of-concept for AI-orchestrated data productisation: taking messy public scientific source data, normalizing it, enriching it with modern identifiers and evidence-proxy fields, adding QA gates, and preparing it for retrieval workflows.

The domain is phytochemistry. The transferable capability is:

data rescue → enrichment → QA gating → retrieval-ready export

Credits

Role	Contributor
Data pipeline and project architecture	Alexander Wirth
Scientific review / partner-resolution contribution	Dominic Fagan (BSc Chemistry)

License

Public sample files in this repository: CC BY 4.0 unless otherwise stated.
Full commercial dataset: separate Ethno-API license terms.
Code snippets / methodology scripts: MIT only where explicitly marked.

Please attribute:

@misc{ethno_api_v24_2026,
  title     = {USDA Phytochemical & Ethnobotanical Database -- Enriched v2.4.0},
  author    = {Wirth, Alexander},
  year      = {2026},
  publisher = {Ethno-API},
  url       = {https://ethno-api.com},
  doi       = {10.5281/zenodo.19660107},
  note      = {76,907 records, 24,746 unique chemical entities, 2,313 plant species}
}

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
assets		assets
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST_v2.json		MANIFEST_v2.json
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
UPDATE_POLICY.md		UPDATE_POLICY.md
ethno_sample_400.json		ethno_sample_400.json
ethno_sample_400.parquet		ethno_sample_400.parquet
llms.txt		llms.txt
noise_exclusion_list.txt		noise_exclusion_list.txt
quickstart.ipynb		quickstart.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ethno-API — USDA Phytochemical & Ethnobotanical Database, Enriched v2.4.0

Public Facts

What This Is

Scope Boundary

Distribution

Public Schema — v2.4.0, 16 Fields

QA Pipeline

Reverse-SMILES QA Audit — v2.4.0

Repository Contents

Quickstart

Use Cases

Limitations — Read Before Use

Why This Exists

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ethno-API — USDA Phytochemical & Ethnobotanical Database, Enriched v2.4.0

Public Facts

What This Is

Scope Boundary

Distribution

Public Schema — v2.4.0, 16 Fields

QA Pipeline

Reverse-SMILES QA Audit — v2.4.0

Repository Contents

Quickstart

Use Cases

Limitations — Read Before Use

Why This Exists

Credits

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages