Ethno-API v2.4.0 is a cleaned and enriched phytochemical data-engineering project derived from the USDA Dr. Duke source data. It contains 76,907 records, 2,313 plant species, 24,746 unique chemical entities, and a 16-field public schema with PubMed, ClinicalTrials.gov, ChEMBL, PatentsView, PubChem CID/SMILES, and partner-assisted CID/IUPAC resolution fields.
No medical claims. No pharmaceutical validation. No safety or efficacy claims. Research and retrieval use only.
| Metric | Value |
|---|---|
| Dataset version | v2.4.0 |
| Records | 76,907 |
| Plant species | 2,313 |
| Unique chemical entities | 24,746 |
| Public schema fields | 16 |
| Partner-assisted CID/IUPAC resolution records | 1,197 |
| Public sample format | JSON, Parquet |
| Public sample size | 400 rows |
| DOI | 10.5281/zenodo.19660107 |
Ethno-API is a machine-readable, enriched version of the USDA Dr. Duke phytochemical and ethnobotanical data, flattened into a practical JSON/Parquet structure for:
- data rescue and normalization demonstrations
- retrieval / RAG ingestion experiments
- phytochemical and natural-products data prototypes
- QA-gated identifier workflows
- data-product portfolio proof for client projects
It is not a medical product, not a clinical validation layer, and not evidence that any compound is safe, effective, therapeutically useful, or suitable for use.
| Area | What it means |
|---|---|
| Data rescue | Source records were normalized into a flatter analytical structure. |
| Enrichment | External identifier and evidence-proxy fields were added where available. |
| QA gating | Identifier consistency checks were applied to reduce obvious export noise. |
| RAG-readiness | The data is structured for retrieval experiments and vector-database ingestion. |
| Scientific review scope | Dominic Fagan (BSc Chemistry) contributed scientific review and partner-assisted identifier resolution. |
| Explicit limitation | This does not validate biological activity, safety, efficacy, dosage relevance, clinical utility, the RAG bridge, or the reverse-SMILES QA system. |
| Field | Type | Coverage | Notes |
|---|---|---|---|
chemical |
string | 100% | USDA compound label, normalized for flat-table use |
plant_species |
string | 100% | Latin binomial species name |
application |
string / null | partial | Source application / activity context where present |
dosage |
string / null | partial | Source dosage or concentration text where present; not usage guidance |
pubmed_mentions_2026 |
integer | enrichment layer | PubMed mention-count snapshot |
clinical_trials_count_2026 |
integer | enrichment layer | ClinicalTrials.gov study-count snapshot; not clinical validation |
chembl_bioactivity_count |
integer | enrichment layer | ChEMBL bioactivity measurement count |
patent_count_since_2020 |
integer / float | enrichment layer | PatentsView / patent-density feature |
pubchem_cid |
integer / null | approx. 75–82% depending on export/audit view | PubChem CID from enrichment pipeline |
canonical_smiles |
string / null | 57,757 records with SMILES | Canonical SMILES retrieved via PubChem |
compound_type |
string | 100% | Compound classification used for filtering |
patent_count_method |
string | 100% | Method label for patent-count derivation |
partner_cid |
integer / null | 1,197 records | Partner-assisted PubChem CID resolution |
inchi_key |
string / null | subset | Partner-assisted InChIKey / identifier resolution |
iupac_verified |
string / bool / null | subset | Partner-assisted identifier-verification state |
partner_match_method |
string / null | subset | Match method used in partner-resolution file |
Partner-resolution file: exports/iupac_cid_resolutions.json — 1,197 partner-assisted CID/IUPAC resolution records.
- Normalize USDA source records into a flat analytical schema.
- Add external enrichment layers from PubMed, ClinicalTrials.gov, ChEMBL, PatentsView, and PubChem.
- Retrieve PubChem CID and canonical SMILES where available.
- Add compound classification and patent-count method fields.
- Add partner-assisted CID/IUPAC resolution subset.
- Run reverse-SMILES QA as a downstream identifier-consistency gate.
- Export JSON and Parquet samples for analysis and retrieval experiments.
| Verdict | Count | Interpretation |
|---|---|---|
validated |
11,981 | Strict round-trip pass |
plausible |
8,370 | Pass with caveats |
review_required |
37,361 | Visible but not auto-trusted |
invalidated |
45 | Failed validation; excluded from default retrieval-ready export |
insufficient_data |
19,150 | No SMILES available |
| Total | 76,907 | Full v2.4.0 input set |
Default retrieval-ready rule: exclude invalidated and insufficient_data. Keep review_required visible, but do not auto-trust it.
Export-eligible records by the stated QA rule: 57,712
validated + plausible + review_required = 11,981 + 8,370 + 37,361
Known v2.4.0 QA fixes include thiol-false-alcohol detection, non-carboxylic-acid classifier correction, and strictest-verdict-wins CID tainting logic.
/
├── README.md
├── METHODOLOGY.md
├── MANIFEST.md
├── LICENSE
├── assets/
│ └── ethno-api-readme-banner.svg
├── exports/
│ ├── ethno_api_v2.4.0_sample.json
│ ├── ethno_api_v2.4.0_sample.parquet
│ ├── ethno_api_schema.json
│ └── iupac_cid_resolutions.json
└── notebooks/
└── quickstart.ipynb
import pandas as pd
df = pd.read_parquet("exports/ethno_api_v2.4.0_sample.parquet")
print(df.shape)
print(df["chemical"].nunique())
print(df["plant_species"].nunique())
preview_columns = [
"chemical",
"plant_species",
"pubchem_cid",
"canonical_smiles",
]
print(df[preview_columns].head())The public repository contains sample exports and documentation. The full licensed export package is distributed through Ethno-API.
| Use case | Fit |
|---|---|
| Retrieval experiments | Build test corpora for vector search and RAG pipelines. |
| Data cleaning demos | Show normalization, enrichment, and QA-gating workflows. |
| Natural-products data prototypes | Explore structured phytochemical source data. |
| Portfolio proof | Demonstrate data rescue → enrichment → QA → export architecture. |
| Review workflows | Inspect identifier-resolution and QA-gate logic on a bounded public sample. |
- No pharmaceutical validation. This dataset does not confirm biological activity, safety, efficacy, dosage relevance, or clinical utility.
- No medical claims. Nothing in this repository constitutes medical advice, treatment guidance, product guidance, or usage recommendation.
- Source dependency. Baseline relationships reflect USDA/source-database records and may contain historical terminology, sparse annotations, or context limitations.
- Coverage gaps. Not all records have PubChem CID, SMILES, InChIKey, dosage/source-concentration text, or activity context.
- Partner validation is scoped. The 1,197 partner-assisted records are identifier-resolution work, not full pharmacological validation.
- QA verdicts are not clinical verdicts. Reverse-SMILES QA checks identifier consistency, not biological truth.
- Snapshot. v2.4.0 reflects a point-in-time enrichment snapshot. External databases can change.
Ethno-API was built as a proof-of-concept for AI-orchestrated data productisation: taking messy public scientific source data, normalizing it, enriching it with modern identifiers and evidence-proxy fields, adding QA gates, and preparing it for retrieval workflows.
The domain is phytochemistry. The transferable capability is:
data rescue → enrichment → QA gating → retrieval-ready export
| Role | Contributor |
|---|---|
| Data pipeline and project architecture | Alexander Wirth |
| Scientific review / partner-resolution contribution | Dominic Fagan (BSc Chemistry) |
Public sample files in this repository: CC BY 4.0 unless otherwise stated.
Full commercial dataset: separate Ethno-API license terms.
Code snippets / methodology scripts: MIT only where explicitly marked.
Please attribute:
@misc{ethno_api_v24_2026,
title = {USDA Phytochemical & Ethnobotanical Database -- Enriched v2.4.0},
author = {Wirth, Alexander},
year = {2026},
publisher = {Ethno-API},
url = {https://ethno-api.com},
doi = {10.5281/zenodo.19660107},
note = {76,907 records, 24,746 unique chemical entities, 2,313 plant species}
}