Skip to content

Align public-repo metadata/docs with verifiable sample artifacts and v2.4.0 schema semantics#2

Draft
Copilot wants to merge 3 commits into
mainfrom
copilot/full-repo-scan-and-fixes
Draft

Align public-repo metadata/docs with verifiable sample artifacts and v2.4.0 schema semantics#2
Copilot wants to merge 3 commits into
mainfrom
copilot/full-repo-scan-and-fixes

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 20, 2026

This PR resolves cross-file inconsistencies across documentation and manifest metadata by reconciling claims against artifacts that actually exist in the public repository. It standardizes versioning, schema semantics, sample/file integrity data, and quickstart references into a single internally consistent source of truth.

  • Scope + source-of-truth normalization

    • Reworked MANIFEST_v2.json to explicitly cover the public sample repo scope (not commercial full-release binaries).
    • Replaced non-verifiable full-package entries with verifiable sample artifacts (ethno_sample_400.*, quickstart.ipynb) and updated SHA-256/byte-size metadata.
  • Schema/version consistency across docs

    • Synchronized README.md, METHODOLOGY.md, UPDATE_POLICY.md, llms.txt, and quickstart.ipynb on:
      • current version (v2.4.0)
      • 16-column schema representation
      • field semantics for partner_cid, inchi_key, iupac_verified, partner_match_method
    • Removed/rewrote contradictory statements (e.g., stale version references, non-existent manifest sections, mismatched field counts).
  • Sample-vs-doc alignment

    • Updated README front matter and tables to reflect actual sample shape/types/null behavior from ethno_sample_400.json / .parquet.
    • Clarified logical-vs-serialization behavior where needed (notably inchi_key in sample parquet being currently all-null).
  • Quickstart/documentation corrections

    • Updated notebook examples/citation text to match current schema (16-column) and valid split usage (train).
    • Updated noise exclusion header context to match current release lineage while preserving audited historical notes.
{
  "manifest_scope": "public_repository_sample",
  "files": {
    "json_sample": {
      "filename": "ethno_sample_400.json",
      "size_bytes": 226228
    },
    "parquet_sample": {
      "filename": "ethno_sample_400.parquet",
      "size_bytes": 33485
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants