Skip to content

feat(corpus): expand seed.yaml via marketplace importer (issue #99 slice 1b)#117

Merged
jdutton merged 5 commits into
mainfrom
feat/corpus-seed-1b
Jun 5, 2026
Merged

feat(corpus): expand seed.yaml via marketplace importer (issue #99 slice 1b)#117
jdutton merged 5 commits into
mainfrom
feat/corpus-seed-1b

Conversation

@ejdutton

@ejdutton ejdutton commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

What this PR changes

File Change
corpus/seed.yaml 9 → 238 entries; new provenance header
packages/dev-tools/src/import-marketplace.ts (new) The importer (bun run import-marketplace)
packages/dev-tools/test/import-marketplace.test.ts (new) 25 unit tests covering each mapping primitive
package.json Adds the import-marketplace script alias
CHANGELOG.md One entry under ### Internal

Mapping rules

Field Rule
bucket official for all imports (both catalogs are anthropics-curated marketplaces — bucket is the reporting posture per slice 1a, not code provenance).
confidence URL-based, not author-based (40% of upstream entries have no author). String-shape sources → first-party; object-shape sources where the resolved GitHub owner is anthropicsfirst-party; everything else → curated. Override: knowledge-work ./partner-built/curated.
maturity production for all. Defer maturity heuristics to a later slice.
name Upstream name verbatim, with knowledge-work entries prefixed knowledge-work- (collision-avoidance per slice 1a). Names containing characters outside [A-Za-z0-9_-] are regex-munged (1 munge in the current run: wordpress.comwordpress-com).
source URL Composed per the 5 upstream source shapes (string, git-subdir ± ref, url ± path, github). 108+ entries don't ship with commit pinning because the upstream manifest doesn't give us a ref for them and the audit clone's --depth 1 mode can't take raw SHAs as refs; they track each external repo's default branch HEAD. Sha-pinning is a followup.
Preservation Read the existing seed.yaml; any entry whose source URL isn't going to be regenerated this run is treated as hand-curated and re-emitted verbatim. (Today: the 2 VAT-owned entries.) Throws loudly if a preserved entry carries a validation: block — stringifyEntries doesn't serialize those yet.

Ratification asks

1. Bucket mapping — everything official

Slice 1a established bucket as the reporting postureofficial entries allow named findings, community is aggregate-only in followup work. Per that framing, every imported entry is bucket: official since both catalogs live under anthropics/. The confidence axis carries the third-party-but-vetted distinction.

Push back if you'd rather external_plugins/-equivalents (url/github-shape entries pointing at non-anthropics repos) go to bucket: community instead. Easy schema-free flip if you do.

2. Source-URL deduplication — drops 35 presentation-name aliases

The upstream catalogs intentionally list the same plugin under multiple presentation names. Across the two catalogs, 33 source URLs are claimed by 2+ upstream entries (~25% of raw imports). Examples: data, data-engineering, and astronomer-data-agents all resolve to github.com/astronomer/agents.git. The knowledge-work catalog turns out to be ~50% mirror entries of the official catalog (30 of 60).

loadSeedFile() treats source as the unique key (it throws on dupes), so we have to deduplicate somewhere. This PR dedupes by source URL — preserved entries always win; within imports, alphabetical-first-name wins. The 35 dropped aliases are documented in the importer's stdout summary on each run.

Trade-off to flag: a user who installs data from the Claude Code marketplace UI and then reads a future corpus-scan report won't find data in it — only astronomer-data-agents. We've effectively decided "the seed represents unique audit targets" rather than "the seed represents every plugin a user can install." That's a real semantic shift. Alternatives if you'd rather not ship that:

  • (a) Change loadSeedFile to allow duplicate sources (schema-style cascade)
  • (b) Maintain a separate name → canonical-source aliasing map for report rendering
  • (c) Different dedup priority (e.g., shorter name wins, manifest-order wins)

Happy to take this in a different direction.

Counts at import time

Upstream marketplaces update minutely; these are a snapshot:

  • anthropics/claude-plugins-official @ 6f90371 — 211 upstream entries → 206 kept
  • anthropics/knowledge-work-plugins @ 8785e40 — 60 upstream entries → 30 kept
  • Preserved (hand-curated): 2 (vibe-agent-toolkit, vibe-validate)
  • Total in seed.yaml: 238
  • Aliases dropped by source-URL dedup: 35

Followups (out of scope)

  • Sha-pin per-entry source URLs (slice 1b uses #main: for catalog-internal entries and omits the ref for external pointers; the manifest carries SHAs we voluntarily ignore until git-url-clone.ts learns to clone-then-checkout). Could fold into slice 2a or its own micro-slice.
  • Drift-detection: a --check mode for the importer that exits non-zero if corpus/seed.yaml would change. Lets CI catch silent drift between commits and upstream.
  • Fix the stale docs/superpowers/specs/2026-05-01-corpus-scan-phase-1-design.md reference in the seed header (was a forward reference to a doc that doesn't exist; my header rewrite drops it).
  • Validation-block serialization in stringifyEntries — currently the importer throws if a preserved entry carries one. The first slice that adds a validation: override will need to extend the stringifier.

Test plan

  • bun run validate (full — 3.5 min, all phases green)
  • Unit tests cover each mapping primitive (mungeName, composeSourceUrl × 6 shapes, deriveConfidence × 5 cases, mapEntry, combineAndDedupe × 5 cases, partitionPreserved × 3 cases) — 25 tests
  • End-to-end: ran bun run import-marketplace, loaded the output via the canonical loadSeedFile, verified 238 entries pass schema + source-uniqueness + name-uniqueness checks
  • Idempotency: re-running with no upstream changes produces byte-identical output (verified via a second run reading the file the first run produced)

🤖 Generated with Claude Code

ejdutton and others added 2 commits June 3, 2026 15:32
Imports all unique plugins from anthropics/claude-plugins-official
(205 of 209 raw entries kept) and anthropics/knowledge-work-plugins
(30 of 60 — the latter is ~50% mirror entries of the former) via a new
committed script at packages/dev-tools/src/import-marketplace.ts
(bun run import-marketplace). Maps each upstream entry to a
PluginEntry, deduplicates by source URL (preserved VAT-owned entries
always win; otherwise alphabetical-first-name wins), and rewrites
corpus/seed.yaml with a provenance header. URL composition handles all
five upstream source shapes; confidence is URL-based (anthropics owner
→ first-party, else curated; ./partner-built/ override → curated).
Issue #99 slice 1b — follows PR #111 (slice 1a).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the hardcoded PRESERVED_ENTRIES list with a structural
partition: read corpus/seed.yaml at the start of the import, treat as
"preserved" any entry whose source URL isn't one the importer would
generate this run. Fixes the self-review issue that the hardcoded list
would silently erase a third hand-added entry or any validation
override on a kept catalog entry.

Throws explicitly if a preserved entry carries a validation block,
since stringifyEntries doesn't yet serialize them. Slice 1b has none.

combineAndDedupe now takes preserved as an explicit parameter rather
than closing over module state — easier to test, no hidden dep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ejdutton ejdutton self-assigned this Jun 3, 2026
@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.78%. Comparing base (97e0903) to head (80b12b9).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #117   +/-   ##
=======================================
  Coverage   81.78%   81.78%           
=======================================
  Files         223      223           
  Lines       17135    17135           
  Branches     3326     3326           
=======================================
  Hits        14014    14014           
  Misses       3121     3121           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jdutton jdutton left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejdutton — thorough multi-pass review (general / tests / error-handling / type-design / comments). Nice work overall: conventions-clean (Postel's law correct, safePath/safeExecSync right, no pre-1.0 cruft), mapping primitives well-tested against real upstream shapes, and no happy-path correctness bugs. Requesting changes for two items below; the rest are fair game for a follow-up.

✅ Cleared a suspected bug first: the composeSourceUrl empty-ref output (...git#:plugin) was traced through parseGitUrlsplitFragmentbuildParsed — it parses correctly (empty ref dropped, subpath preserved). Not a bug. But nothing tests that cross-package contract (see S3).

On your two ratification asks: both are product calls, not code issues, and the code implements each cleanly — I'm fine with everything→official bucket (confidence carries the third-party-but-vetted axis) and with source-URL dedup (seed = unique audit targets). Will reply separately if I want the alias-aliasing map later; ship as-is.


🔴 Fix before merge

C1 — No guard against catastrophic seed shrinkage (silent data loss). writeFileSync overwrites seed.yaml unconditionally. If a catalog temporarily serves plugins: [] or a handful (mid-deploy, bad upstream commit), every step "succeeds" — an empty array is schema-valid — and the seed silently collapses 238 → ~32, committed on a green ✓ Wrote. This is the one place the script can produce a quietly-wrong committed artifact. Assert each catalog returned >0 plugins, and refuse a >20% drop vs. the existing file unless --allow-shrink.

C2 — Type lie in partitionPreserved casts, colliding with the no-public-shaming gate. e.bucket as PluginEntry['bucket'] forces the wide ExistingPluginEntry (community/listed/experimental) into the narrow PluginEntry (official/production). Harmless at runtime today (only the 2 first-party VAT entries preserved; stringifyEntries writes the real value), but the in-memory type now misrepresents itself — and bucket: official vs community is exactly the "named findings vs aggregate-only" gate. The moment a community entry is preserved, any code trusting the static type is silently wrong. Cheap fix: widen the importer's pipeline type to match canonical PluginEntrySchemamapEntry still assigns narrow literals, the three as casts just disappear. (Flagged independently by both the general and type-design passes.)


🟡 Important (fold in if convenient — all a few lines)

I1 — Idempotency claim is false across days + generated header oversells pinning. buildHeader embeds new Date() + volatile catalog SHAs, so "byte-identical re-run" holds only same-day with unchanged HEADs — a future --check drift mode would false-fail on the date alone. Separately, the header written into seed.yaml says "SHAs reflect upstream state" and "each entry can carry a validation: block" — but entries track #main:/default-branch HEAD (no per-entry SHA pinning) and the importer throws rather than emit a validation block. The generated file documents capabilities its own generator won't produce. Narrow both claims.

I2 — fetchCatalogSha trusts .trim().slice(0,7) with no validation. A gh deprecation/update notice on stdout, or empty output from a jq miss, silently writes a garbage or blank provenance SHA into the header — no crash, no downstream guard. Assert /^[0-9a-f]{40}$/ before slicing.

I3 — Fetch/parse errors lose all context. fetchManifest runs JSON.parse(ghFetch(...)) with no catalog identity, so an empty/non-JSON/wrong-shape exit-0 body surfaces as a bare SyntaxError with no idea which catalog. And the top-level catch logs only err.message, discarding CommandExecutionError.stderr (the actual "HTTP 403: rate limit" / "not authenticated" reason) and Zod .issues. Wrap parses with catalog context; print err.stderr in the catch.


🟢 Suggestions (follow-up OK)

  • S1 — ~/code/vat-issue-99-slice-1b-plan.md in the module header (line 9) is a dead pointer into a personal home dir — unfollowable by anyone else, and per our no-spec-commits convention it'll never be in the tree. Inline the rules (the deriveConfidence/composeSourceUrl JSDoc already do most of it) and cite issue #99 by number.
  • S2 — Model source as a Zod discriminated union on source.source (git-subdir/url/github, each .passthrough()). Deletes every (src as Record<string, unknown>)['ref'] cast and turns the unknown-discriminator throw into a compiler-checked exhaustiveness guard.
  • S3 — Add a round-trip test feeding composeSourceUrl outputs through the real parseGitUrl (the empty-ref #:path form is load-bearing but protected by nothing across the package boundary), plus a golden/parse-back test for stringifyEntries+buildHeader (the tool's actual output, currently verified only by manual runs).
  • S4 — Untested branches: deriveConfidence no-url→curated fallthrough (a silent posture downgrade), assertUniqueNames duplicate-munged-name throw, and loadExistingSeed's custom missing-file message. One-liner each.
  • S5 — combineAndDedupe's "alphabetical-first wins" determinism lives in the caller's .sort(), not the function. Accurate today; a future caller forgetting the sort silently churns the seed. Sort defensively inside, or assert sorted input.

Acceptable debt (no action): the hand-mirrored PluginEntryPluginEntrySchema — deliberate to avoid a dev-tools→cli dep, and loadSeedFile validates downstream. A test asserting the two enum sets match would close the residual drift risk cheaply.


TL;DR: land C1 + C2 (both small; C1 is the only path to a silently-corrupted committed seed), fold in I1–I3 if you're in there anyway, defer the S-items. 🤖 Reviewed with Claude Code.

ejdutton and others added 2 commits June 5, 2026 13:10
C1: refuse to overwrite seed.yaml when either upstream catalog returned
0 plugins, or when the new entry count would drop more than 20% vs.
the existing seed. New --allow-shrink flag bypasses both for the rare
case where shrinkage is real (mid-deploy push, etc.).

C2: widen the in-memory PluginEntry to match canonical PluginEntrySchema
(bucket includes community; confidence includes listed; maturity
includes experimental/example). This removes three "as PluginEntry[...]"
casts in partitionPreserved that silently coerced wider preserved
entries into the narrow shape — a footgun at the public-shaming gate
the moment any community entry got preserved. mapEntry still emits the
narrow literals for freshly-mapped upstream entries.

I1: narrow the generated seed.yaml header — drop the per-entry
`validation:` claim (the importer throws on validation blocks today)
and rewrite the SHA language to clarify that entry `source` URLs pin
a fragment ref (typically default branch), not a per-entry commit SHA.
The catalog SHAs in the header are this run's audit provenance, not
per-entry pinning. Also narrow the module-level "byte-identical" claim,
which holds only same-day with unchanged upstream HEADs.

I2: validate fetchCatalogSha output as 40-char hex before slicing, so
a `gh` deprecation notice or `--jq` miss can't silently write garbage
provenance into the header.

I3: preserve fetch/parse error context. fetchManifest wraps JSON.parse
and Zod schema errors with catalog identity and (for JSON) a body
preview. Top-level catch prints CommandExecutionError.stderr (the real
"HTTP 403 rate-limit" / "not authenticated" reason) and ZodError
issue details for errors raised outside fetchManifest.

S-items deferred per reviewer TL;DR. Committed seed.yaml header patched
in place to match the new buildHeader output (without re-running the
import, since upstream drifted to a duplicate-name state since the PR
opened — unrelated to this fix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ejdutton

ejdutton commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

@jdutton — review fixes landed in b02b058. All five gates addressed (C1, C2, I1, I2, I3); S-items deferred per your TL;DR.

C1corpus/seed.yaml is now guarded by two new exported helpers wired into run():

  • assertCatalogNonEmpty(catalog, count, allowShrink) — throws if either upstream catalog returns 0 plugins.
  • assertNoCatastrophicShrinkage(existing, new, allowShrink) — throws when the new count would drop more than 20% vs. the existing seed. Bootstrap (existing === 0) is always allowed.
  • New --allow-shrink CLI flag bypasses both gates for the rare real-shrinkage case; parseRunArgs is exported and tested.
  • Both new helpers covered: 20% boundary exact-allowed, >20% throws, bypass with flag, error message carries triage values. 14 new tests, 37 total in this file.

C2PluginEntry now matches the canonical PluginEntrySchema (bucket: 'official' | 'community', confidence: 'first-party' | 'curated' | 'listed', maturity: 'production' | 'experimental' | 'example'). The three as PluginEntry[...] casts in partitionPreserved are gone. mapEntry still emits the narrow official / production literals; nothing changes at runtime, but the in-memory type now matches the file's stated bucket so the public-shaming gate trusts what it sees.

I1buildHeader no longer claims per-entry validation: blocks (the importer throws on them) and the SHA language is rewritten: the catalog SHAs in the header are the audit provenance of this importer run, while entry source URLs pin a fragment ref (typically the default branch), not a per-entry commit SHA. The module-level "byte-identical across runs" comment narrowed too. The committed corpus/seed.yaml header is patched in place to match — couldn't re-run the importer for this PR because upstream knowledge-work drifted to a duplicate-name state (knowledge-work-ai-firstify) since this PR opened, which is correctly caught by assertUniqueNames but would expand scope here. A follow-up import will refresh the file naturally.

I2fetchCatalogSha now asserts /^[0-9a-f]{40}$/ before .slice(0,7) and includes the first 200 chars of an unexpected response in the error message.

I3fetchManifest wraps JSON.parse and ManifestSchema.parse with catalog identity (owner/name) and, for JSON, a body preview. The top-level catch surfaces CommandExecutionError.stderr (the real rate-limit / auth message) and prints ZodError.issues for errors raised outside fetchManifest (e.g. loadExistingSeed).

Local validation green across all three phases on the merged tree. Integration test audit-packaging-shapes flaked once at 60s and passed on retry — surfaced as a non-introduced flake in vibe-validate's flake-detection output. Unrelated to this change.

🤖 Generated with Claude Code

@jdutton jdutton left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approving — all requested changes landed and verified against the diff (commit b02b058a).

Blockers cleared:

  • C1 — silent seed shrinkage now guarded: assertCatalogNonEmpty (refuses a 0-plugin catalog) + assertNoCatastrophicShrinkage (refuses a >20% drop vs. the existing seed), both wired into run() before the write, with --allow-shrink as the deliberate escape hatch. Well unit-tested — exact-20% boundary, fractionally-above, bootstrap existing=0, bypass, and error-message content all covered.
  • C2 — the as PluginEntry[...] type lie is gone: PluginEntry is widened to the full canonical union and the three casts in partitionPreserved are now direct assignments, while mapEntry still emits the narrow literals. The static type at the bucket gate is now honest.

Important items folded in: I1 (header/JSDoc claims narrowed; committed seed.yaml header matches the new buildHeader() output), I2 (fetchCatalogSha validates 40-char hex before slicing), I3 (fetch/parse errors now carry catalog identity + CommandExecutionError.stderr / Zod issues).

CI is green on the new head (ubuntu + windows, Sonar, codecov patch & project). S-items deferred as agreed.

One non-blocking note for next time: the commit mentions upstream has drifted to a duplicate-name state, so a future clean re-import will exercise the name-munging path rather than reproduce this seed byte-for-byte — worth a glance when you next regenerate. Ship it.

Resolves the CHANGELOG.md [Unreleased] conflict against #120 (linkAuth,
now on main). Beyond the union, consolidated the two redundant
empirical-compat-harness Internal bullets (v1 scaffold + v2 foundations,
same subsystem/same release cycle) into one coherent entry preserving
all facts, and ordered Internal as features → compat-empirical cluster.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sonarqubecloud

sonarqubecloud Bot commented Jun 5, 2026

Copy link
Copy Markdown

@jdutton jdutton merged commit 5d6f2ff into main Jun 5, 2026
7 checks passed
@jdutton jdutton deleted the feat/corpus-seed-1b branch June 5, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants