feat(corpus): expand seed.yaml via marketplace importer (issue #99 slice 1b)#117
Conversation
Imports all unique plugins from anthropics/claude-plugins-official (205 of 209 raw entries kept) and anthropics/knowledge-work-plugins (30 of 60 — the latter is ~50% mirror entries of the former) via a new committed script at packages/dev-tools/src/import-marketplace.ts (bun run import-marketplace). Maps each upstream entry to a PluginEntry, deduplicates by source URL (preserved VAT-owned entries always win; otherwise alphabetical-first-name wins), and rewrites corpus/seed.yaml with a provenance header. URL composition handles all five upstream source shapes; confidence is URL-based (anthropics owner → first-party, else curated; ./partner-built/ override → curated). Issue #99 slice 1b — follows PR #111 (slice 1a). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the hardcoded PRESERVED_ENTRIES list with a structural partition: read corpus/seed.yaml at the start of the import, treat as "preserved" any entry whose source URL isn't one the importer would generate this run. Fixes the self-review issue that the hardcoded list would silently erase a third hand-added entry or any validation override on a kept catalog entry. Throws explicitly if a preserved entry carries a validation block, since stringifyEntries doesn't yet serialize them. Slice 1b has none. combineAndDedupe now takes preserved as an explicit parameter rather than closing over module state — easier to test, no hidden dep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #117 +/- ##
=======================================
Coverage 81.78% 81.78%
=======================================
Files 223 223
Lines 17135 17135
Branches 3326 3326
=======================================
Hits 14014 14014
Misses 3121 3121 🚀 New features to boost your workflow:
|
jdutton
left a comment
There was a problem hiding this comment.
@ejdutton — thorough multi-pass review (general / tests / error-handling / type-design / comments). Nice work overall: conventions-clean (Postel's law correct, safePath/safeExecSync right, no pre-1.0 cruft), mapping primitives well-tested against real upstream shapes, and no happy-path correctness bugs. Requesting changes for two items below; the rest are fair game for a follow-up.
✅ Cleared a suspected bug first: the composeSourceUrl empty-ref output (...git#:plugin) was traced through parseGitUrl → splitFragment → buildParsed — it parses correctly (empty ref dropped, subpath preserved). Not a bug. But nothing tests that cross-package contract (see S3).
On your two ratification asks: both are product calls, not code issues, and the code implements each cleanly — I'm fine with everything→official bucket (confidence carries the third-party-but-vetted axis) and with source-URL dedup (seed = unique audit targets). Will reply separately if I want the alias-aliasing map later; ship as-is.
🔴 Fix before merge
C1 — No guard against catastrophic seed shrinkage (silent data loss). writeFileSync overwrites seed.yaml unconditionally. If a catalog temporarily serves plugins: [] or a handful (mid-deploy, bad upstream commit), every step "succeeds" — an empty array is schema-valid — and the seed silently collapses 238 → ~32, committed on a green ✓ Wrote. This is the one place the script can produce a quietly-wrong committed artifact. Assert each catalog returned >0 plugins, and refuse a >20% drop vs. the existing file unless --allow-shrink.
C2 — Type lie in partitionPreserved casts, colliding with the no-public-shaming gate. e.bucket as PluginEntry['bucket'] forces the wide ExistingPluginEntry (community/listed/experimental) into the narrow PluginEntry (official/production). Harmless at runtime today (only the 2 first-party VAT entries preserved; stringifyEntries writes the real value), but the in-memory type now misrepresents itself — and bucket: official vs community is exactly the "named findings vs aggregate-only" gate. The moment a community entry is preserved, any code trusting the static type is silently wrong. Cheap fix: widen the importer's pipeline type to match canonical PluginEntrySchema — mapEntry still assigns narrow literals, the three as casts just disappear. (Flagged independently by both the general and type-design passes.)
🟡 Important (fold in if convenient — all a few lines)
I1 — Idempotency claim is false across days + generated header oversells pinning. buildHeader embeds new Date() + volatile catalog SHAs, so "byte-identical re-run" holds only same-day with unchanged HEADs — a future --check drift mode would false-fail on the date alone. Separately, the header written into seed.yaml says "SHAs reflect upstream state" and "each entry can carry a validation: block" — but entries track #main:/default-branch HEAD (no per-entry SHA pinning) and the importer throws rather than emit a validation block. The generated file documents capabilities its own generator won't produce. Narrow both claims.
I2 — fetchCatalogSha trusts .trim().slice(0,7) with no validation. A gh deprecation/update notice on stdout, or empty output from a jq miss, silently writes a garbage or blank provenance SHA into the header — no crash, no downstream guard. Assert /^[0-9a-f]{40}$/ before slicing.
I3 — Fetch/parse errors lose all context. fetchManifest runs JSON.parse(ghFetch(...)) with no catalog identity, so an empty/non-JSON/wrong-shape exit-0 body surfaces as a bare SyntaxError with no idea which catalog. And the top-level catch logs only err.message, discarding CommandExecutionError.stderr (the actual "HTTP 403: rate limit" / "not authenticated" reason) and Zod .issues. Wrap parses with catalog context; print err.stderr in the catch.
🟢 Suggestions (follow-up OK)
- S1 —
~/code/vat-issue-99-slice-1b-plan.mdin the module header (line 9) is a dead pointer into a personal home dir — unfollowable by anyone else, and per our no-spec-commits convention it'll never be in the tree. Inline the rules (thederiveConfidence/composeSourceUrlJSDoc already do most of it) and cite issue #99 by number. - S2 — Model
sourceas a Zod discriminated union onsource.source(git-subdir/url/github, each.passthrough()). Deletes every(src as Record<string, unknown>)['ref']cast and turns the unknown-discriminatorthrowinto a compiler-checked exhaustiveness guard. - S3 — Add a round-trip test feeding
composeSourceUrloutputs through the realparseGitUrl(the empty-ref#:pathform is load-bearing but protected by nothing across the package boundary), plus a golden/parse-back test forstringifyEntries+buildHeader(the tool's actual output, currently verified only by manual runs). - S4 — Untested branches:
deriveConfidenceno-url→curatedfallthrough (a silent posture downgrade),assertUniqueNamesduplicate-munged-name throw, andloadExistingSeed's custom missing-file message. One-liner each. - S5 —
combineAndDedupe's "alphabetical-first wins" determinism lives in the caller's.sort(), not the function. Accurate today; a future caller forgetting the sort silently churns the seed. Sort defensively inside, or assert sorted input.
Acceptable debt (no action): the hand-mirrored PluginEntry ↔ PluginEntrySchema — deliberate to avoid a dev-tools→cli dep, and loadSeedFile validates downstream. A test asserting the two enum sets match would close the residual drift risk cheaply.
TL;DR: land C1 + C2 (both small; C1 is the only path to a silently-corrupted committed seed), fold in I1–I3 if you're in there anyway, defer the S-items. 🤖 Reviewed with Claude Code.
C1: refuse to overwrite seed.yaml when either upstream catalog returned 0 plugins, or when the new entry count would drop more than 20% vs. the existing seed. New --allow-shrink flag bypasses both for the rare case where shrinkage is real (mid-deploy push, etc.). C2: widen the in-memory PluginEntry to match canonical PluginEntrySchema (bucket includes community; confidence includes listed; maturity includes experimental/example). This removes three "as PluginEntry[...]" casts in partitionPreserved that silently coerced wider preserved entries into the narrow shape — a footgun at the public-shaming gate the moment any community entry got preserved. mapEntry still emits the narrow literals for freshly-mapped upstream entries. I1: narrow the generated seed.yaml header — drop the per-entry `validation:` claim (the importer throws on validation blocks today) and rewrite the SHA language to clarify that entry `source` URLs pin a fragment ref (typically default branch), not a per-entry commit SHA. The catalog SHAs in the header are this run's audit provenance, not per-entry pinning. Also narrow the module-level "byte-identical" claim, which holds only same-day with unchanged upstream HEADs. I2: validate fetchCatalogSha output as 40-char hex before slicing, so a `gh` deprecation notice or `--jq` miss can't silently write garbage provenance into the header. I3: preserve fetch/parse error context. fetchManifest wraps JSON.parse and Zod schema errors with catalog identity and (for JSON) a body preview. Top-level catch prints CommandExecutionError.stderr (the real "HTTP 403 rate-limit" / "not authenticated" reason) and ZodError issue details for errors raised outside fetchManifest. S-items deferred per reviewer TL;DR. Committed seed.yaml header patched in place to match the new buildHeader output (without re-running the import, since upstream drifted to a duplicate-name state since the PR opened — unrelated to this fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@jdutton — review fixes landed in b02b058. All five gates addressed (C1, C2, I1, I2, I3); S-items deferred per your TL;DR. C1 —
C2 — I1 — I2 — I3 — Local validation green across all three phases on the merged tree. Integration test 🤖 Generated with Claude Code |
jdutton
left a comment
There was a problem hiding this comment.
✅ Approving — all requested changes landed and verified against the diff (commit b02b058a).
Blockers cleared:
- C1 — silent seed shrinkage now guarded:
assertCatalogNonEmpty(refuses a 0-plugin catalog) +assertNoCatastrophicShrinkage(refuses a >20% drop vs. the existing seed), both wired intorun()before the write, with--allow-shrinkas the deliberate escape hatch. Well unit-tested — exact-20% boundary, fractionally-above, bootstrapexisting=0, bypass, and error-message content all covered. - C2 — the
as PluginEntry[...]type lie is gone:PluginEntryis widened to the full canonical union and the three casts inpartitionPreservedare now direct assignments, whilemapEntrystill emits the narrow literals. The static type at the bucket gate is now honest.
Important items folded in: I1 (header/JSDoc claims narrowed; committed seed.yaml header matches the new buildHeader() output), I2 (fetchCatalogSha validates 40-char hex before slicing), I3 (fetch/parse errors now carry catalog identity + CommandExecutionError.stderr / Zod issues).
CI is green on the new head (ubuntu + windows, Sonar, codecov patch & project). S-items deferred as agreed.
One non-blocking note for next time: the commit mentions upstream has drifted to a duplicate-name state, so a future clean re-import will exercise the name-munging path rather than reproduce this seed byte-for-byte — worth a glance when you next regenerate. Ship it.
Resolves the CHANGELOG.md [Unreleased] conflict against #120 (linkAuth, now on main). Beyond the union, consolidated the two redundant empirical-compat-harness Internal bullets (v1 scaffold + v2 foundations, same subsystem/same release cycle) into one coherent entry preserving all facts, and ordered Internal as features → compat-empirical cluster. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|



Summary
corpus/seed.yamlgrows from 9 → ~238 entries via a new committed importer atpackages/dev-tools/src/import-marketplace.ts(bun run import-marketplace). The script fetches.claude-plugin/marketplace.jsonfrom both anthropics catalogs and rewrites the seed.What this PR changes
corpus/seed.yamlpackages/dev-tools/src/import-marketplace.ts(new)bun run import-marketplace)packages/dev-tools/test/import-marketplace.test.ts(new)package.jsonimport-marketplacescript aliasCHANGELOG.md### InternalMapping rules
bucketofficialfor all imports (both catalogs are anthropics-curated marketplaces —bucketis the reporting posture per slice 1a, not code provenance).confidenceauthor). String-shape sources →first-party; object-shape sources where the resolved GitHub owner isanthropics→first-party; everything else →curated. Override: knowledge-work./partner-built/→curated.maturityproductionfor all. Defer maturity heuristics to a later slice.namenameverbatim, with knowledge-work entries prefixedknowledge-work-(collision-avoidance per slice 1a). Names containing characters outside[A-Za-z0-9_-]are regex-munged (1 munge in the current run:wordpress.com→wordpress-com).sourceURLgit-subdir±ref,url±path,github). 108+ entries don't ship with commit pinning because the upstream manifest doesn't give us areffor them and the audit clone's--depth 1mode can't take raw SHAs as refs; they track each external repo's default branch HEAD. Sha-pinning is a followup.sourceURL isn't going to be regenerated this run is treated as hand-curated and re-emitted verbatim. (Today: the 2 VAT-owned entries.) Throws loudly if a preserved entry carries avalidation:block —stringifyEntriesdoesn't serialize those yet.Ratification asks
1. Bucket mapping — everything
officialSlice 1a established
bucketas the reporting posture —officialentries allow named findings,communityis aggregate-only in followup work. Per that framing, every imported entry isbucket: officialsince both catalogs live underanthropics/. Theconfidenceaxis carries the third-party-but-vetted distinction.Push back if you'd rather
external_plugins/-equivalents (url/github-shape entries pointing at non-anthropics repos) go tobucket: communityinstead. Easy schema-free flip if you do.2. Source-URL deduplication — drops 35 presentation-name aliases
The upstream catalogs intentionally list the same plugin under multiple presentation names. Across the two catalogs, 33 source URLs are claimed by 2+ upstream entries (~25% of raw imports). Examples:
data,data-engineering, andastronomer-data-agentsall resolve togithub.com/astronomer/agents.git. The knowledge-work catalog turns out to be ~50% mirror entries of the official catalog (30 of 60).loadSeedFile()treatssourceas the unique key (it throws on dupes), so we have to deduplicate somewhere. This PR dedupes by source URL — preserved entries always win; within imports, alphabetical-first-name wins. The 35 dropped aliases are documented in the importer's stdout summary on each run.Trade-off to flag: a user who installs
datafrom the Claude Code marketplace UI and then reads a future corpus-scan report won't finddatain it — onlyastronomer-data-agents. We've effectively decided "the seed represents unique audit targets" rather than "the seed represents every plugin a user can install." That's a real semantic shift. Alternatives if you'd rather not ship that:loadSeedFileto allow duplicate sources (schema-style cascade)name → canonical-sourcealiasing map for report renderingHappy to take this in a different direction.
Counts at import time
Upstream marketplaces update minutely; these are a snapshot:
anthropics/claude-plugins-official@6f90371— 211 upstream entries → 206 keptanthropics/knowledge-work-plugins@8785e40— 60 upstream entries → 30 keptFollowups (out of scope)
sourceURLs (slice 1b uses#main:for catalog-internal entries and omits the ref for external pointers; the manifest carries SHAs we voluntarily ignore untilgit-url-clone.tslearns to clone-then-checkout). Could fold into slice 2a or its own micro-slice.--checkmode for the importer that exits non-zero ifcorpus/seed.yamlwould change. Lets CI catch silent drift between commits and upstream.docs/superpowers/specs/2026-05-01-corpus-scan-phase-1-design.mdreference in the seed header (was a forward reference to a doc that doesn't exist; my header rewrite drops it).stringifyEntries— currently the importer throws if a preserved entry carries one. The first slice that adds avalidation:override will need to extend the stringifier.Test plan
bun run validate(full — 3.5 min, all phases green)mungeName,composeSourceUrl× 6 shapes,deriveConfidence× 5 cases,mapEntry,combineAndDedupe× 5 cases,partitionPreserved× 3 cases) — 25 testsbun run import-marketplace, loaded the output via the canonicalloadSeedFile, verified 238 entries pass schema + source-uniqueness + name-uniqueness checks🤖 Generated with Claude Code