jdutton · jdutton · Jun 5, 2026 · Jun 3, 2026 · Jun 3, 2026 · Jun 5, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,8 +10,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Internal
 
 - **linkAuth pure engine foundation (issue #113, slice 1).** Adds a config-driven engine for authenticated external URL resolution, scoped to the pure-logic layer with no consumer wiring yet (the `ExternalLinkValidator` integration and the `LINK_AUTH_*` `CODE_REGISTRY` entries are slice 2; the content-fetch primitive is slice 3). New `link-auth/` module under `@vibe-agent-toolkit/utils` with eight files: `transforms.ts` (closed allowlist — `base64url`, `urlencode`, `lower` — with `Object.hasOwn`-based prototype-chain defense), `template.ts` (tiny `${name}` / `${transform(name)}` renderer, deliberately separate from the Handlebars renderer in `utils/template.ts`), `rewrite.ts` (ordered `when → vars → to` pipeline with fragment/query stripping per design §5.2), `build-headers.ts` (header rendering plus structural `Authorization` redaction), `select-provider.ts` (host-glob matching via picomatch with `excludeHost`), `expand-macro.ts` (YAML loader + deep-merge expander), `resolve-token.ts` (ordered env / `safeExecResult`-backed argv-command sources, first-non-empty wins, no shell), and `resolve.ts` (the public `resolveAuthenticatedUrl(url, config)` entry returning one of `{fetchUrl, headers}` / `{outcome: 'unsupported'}` / `{outcome: 'unverified', reason}`). Ships the `github` and `sharepoint` macros as a YAML data asset (`src/link-auth/macros.yaml`), with a new cross-platform `packages/dev-tools/src/copy-yaml-assets.ts` post-build step bundling `.yaml` into `dist/` — first YAML-asset shipping pattern in the utils package. Adds `yaml` as a utils dependency. Companion Zod schema in `@vibe-agent-toolkit/resources` (`src/schemas/link-auth.ts`) validates the `resources.linkAuth` config block (strict; accepts either `{ use: <macro>, ...overrides }` or full inline providers), wired as an optional field on `ResourcesConfigSchema`. 140 unit tests in utils + 29 schema tests in resources, all pure-logic with no network or filesystem dependencies; security-load-bearing tests pin the closed-allowlist guarantee, the `${__proto__}` bypass defense, the token-never-leaks-into-Authorization invariant, and `shell: false` literal-argv handling.
-- **Empirical compatibility harness (`packages/dev-tools/src/compat-empirical/`).** Per-#100 research scaffold: a CLI (`predict`/`run`/`judge`/`report`/`all`) that runs candidate skills against claude-code, claude-cowork, and claude-chat, then joins VAT's static predictions with deterministic runtime observations and an LLM-judge semantic read into a reality-vs-prediction matrix. The output is an evidence artifact a follow-up PR will draw on to propose detector improvements, each citing specific (skill, runtime) cells. No detector code or `RUNTIME_PROFILES` changes here. Lives in the private `@vibe-agent-toolkit/dev-tools` package no adopter-facing surface.
-- **Empirical compat harness v2 (`packages/dev-tools/src/compat-empirical/`).** Foundations PR per [the v2 design](./docs/research/2026-05-23-compat-empirical-harness-v2-design.md). Probe coverage: multi-prompt + repeat-N with adaptive N=3→N=5 extension, mandatory positive+negative prompt pairing per corpus entry, and negative-prompt agreement inversion so false-positive triggers surface as `vat-optimistic`. Evidence quality: deterministic class widened from 6 to 9 values (splitting `error` into `install-failed`/`runtime-error`, `not-invoked` into `not-invoked-engaged`/`not-invoked-empty`, adding `refused`), judge prompt rewritten to v2 with a `refused` verdict. Report fidelity: coverage stats, per-bucket headline (own/official/community × ran/agree/optimistic/pessimistic/gray-zone), gray-zone (mixed-signal) and high-variance subsections, per-attempt variance rendered inline (`runtime-error (2/3) / failed (3/3)`). Judge replay: persisted `judge-calls/<skillId>-<promptId>-<target>-<attemptIdx>.json` artifacts plus a new `re-judge` subcommand that re-executes them against an optionally different model or freshly-edited system prompt without re-spending operator hours on the runtime side. Two PR-#108 deferred bug fixes also landed: `git fetch --tags --force` before named-ref fetch (annotated tag refresh) and `setup()` teardown-first idempotency for the manual driver. Still private to `@vibe-agent-toolkit/dev-tools`; corpus authoring, first real run, and the docs deliverable are the downstream work.
+- **Corpus seed expanded from 9 → 237 entries via a new committed importer at `packages/dev-tools/src/import-marketplace.ts` (`bun run import-marketplace [--allow-shrink]`).** The script fetches `.claude-plugin/marketplace.json` from `anthropics/claude-plugins-official` (205 of 209 raw entries kept) and `anthropics/knowledge-work-plugins` (30 of 60 — the knowledge-work catalog turns out to be ≈50% mirror entries of the official catalog) via `gh api`, maps each upstream entry to a `PluginEntry`, deduplicates by `source` URL (preserved VAT-owned entries always win; otherwise alphabetical-first-name wins within each duplicate cluster), and rewrites `corpus/seed.yaml`. Mapping rules: `bucket: official` uniformly (both catalogs are anthropics-curated marketplaces — `bucket` is the *reporting posture* per slice 1a, not code provenance); `confidence: first-party` for catalog-internal string sources and `github.com/anthropics/...` object sources, else `curated`; the `./partner-built/` knowledge-work convention overrides to `curated`; `maturity: production` for all entries. URL composition handles all five upstream source shapes (string, `git-subdir` ± `ref`, `url` ± `path`, `github`), throwing on unknown discriminators. The seven sample entries from slice 1a are regenerated from upstream manifests on every re-import. Re-import safety: the importer refuses to overwrite `corpus/seed.yaml` if either upstream catalog returned 0 plugins or the new entry count would drop more than 20% vs. the existing seed; `--allow-shrink` bypasses both gates for the rare case where shrinkage is real. The generated `seed.yaml` header dropped its earlier per-entry `validation:` claim (the importer throws on validation blocks today) and now states explicitly that entry `source` URLs pin a fragment ref (typically the default branch), not a per-entry commit SHA — the catalog SHAs in the header are this run's audit provenance. Issue #99 slice 1b — follows the schema change from PR #111 (slice 1a).
+- **Empirical compatibility harness (`packages/dev-tools/src/compat-empirical/`).** Per-#100 research scaffold for measuring skill compatibility across `claude-code`, `claude-cowork`, and `claude-chat`: a CLI (`predict`/`run`/`judge`/`report`/`all`) that joins VAT's static predictions with deterministic runtime observations and an LLM-judge semantic read into a reality-vs-prediction matrix — an evidence artifact for proposing detector improvements that each cite specific (skill, runtime) cells. Probe coverage: multi-prompt + repeat-N with adaptive N=3→N=5 extension, mandatory positive+negative prompt pairing per corpus entry, and negative-prompt agreement inversion so false-positive triggers surface as `vat-optimistic`. Evidence quality: the deterministic class is widened from 6 to 9 values (splitting `error` into `install-failed`/`runtime-error`, `not-invoked` into `not-invoked-engaged`/`not-invoked-empty`, adding `refused`), with a v2 judge prompt that adds a `refused` verdict. Report fidelity: coverage stats, per-bucket headline (own/official/community × ran/agree/optimistic/pessimistic/gray-zone), gray-zone (mixed-signal) and high-variance subsections, and per-attempt variance rendered inline (`runtime-error (2/3) / failed (3/3)`). Judge replay persists `judge-calls/<skillId>-<promptId>-<target>-<attemptIdx>.json` artifacts that a new `re-judge` subcommand re-executes against an optionally different model or freshly-edited system prompt — without re-spending operator hours on the runtime side. Also landed: `git fetch --tags --force` before named-ref fetch (annotated tag refresh) and `setup()` teardown-first idempotency for the manual driver. No detector code or `RUNTIME_PROFILES` changes; lives entirely in the private `@vibe-agent-toolkit/dev-tools` package with no adopter-facing surface. Design: [the v2 harness design](./docs/research/2026-05-23-compat-empirical-harness-v2-design.md). Corpus authoring, the first real run, and the docs deliverable are the downstream work.
 - **Cowork driver spike.** Added [`docs/contributing/cowork-driver-spike.md`](docs/contributing/cowork-driver-spike.md) — a time-boxed investigation (per §4a of the harness v2 design) of whether `claude-cowork` can be driven programmatically by the empirical compat harness today. Verdict: **not feasible**; cowork is a Claude Desktop app product with no public API/CLI surface. The `claude-cowork` runtime stays on `scripted-assisted` until Anthropic ships a Cowork CLI mode, Sessions API, or documented filesystem-import path. Adjacent finding (not a cowork replacement): the public-beta Skills API (`POST /v1/skills` + `container.skills[]` on `/v1/messages`) supports a fully-automatable *new* runtime — captured in the spike doc as a potential follow-up, gated on a separate design decision.
 - **Subscription-only compat harness billing.** The harness now bills a Claude Pro/Max subscription instead of the API: both token-consuming surfaces (the `claude-code` runtime driver and the LLM judge) route through one shared `claude` CLI invoker (`runtimes/shared/claude-cli.ts`) that injects the operator's `CLAUDE_CODE_OAUTH_TOKEN` and deletes every API credential from the child env, so the CLI cannot fall back to API billing. The operator's own token is sourced at preflight — env var if set, otherwise an interactive prompt — so a run only ever spends the operator's personal plan. The judge was migrated off `@anthropic-ai/sdk` (dependency removed) onto the CLI, parsing a strict JSON verdict with one retry instead of the SDK's forced-tool call (`judge-system.md` now asks for a JSON object). `RunMetadata` gains `authMode` and the report methodology discloses subscription auth + parsed-not-forced verdicts. Premise (zero API billing under the OAuth token) still pending the manual smoke test.