Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Internal

- **linkAuth pure engine foundation (issue #113, slice 1).** Adds a config-driven engine for authenticated external URL resolution, scoped to the pure-logic layer with no consumer wiring yet (the `ExternalLinkValidator` integration and the `LINK_AUTH_*` `CODE_REGISTRY` entries are slice 2; the content-fetch primitive is slice 3). New `link-auth/` module under `@vibe-agent-toolkit/utils` with eight files: `transforms.ts` (closed allowlist — `base64url`, `urlencode`, `lower` — with `Object.hasOwn`-based prototype-chain defense), `template.ts` (tiny `${name}` / `${transform(name)}` renderer, deliberately separate from the Handlebars renderer in `utils/template.ts`), `rewrite.ts` (ordered `when → vars → to` pipeline with fragment/query stripping per design §5.2), `build-headers.ts` (header rendering plus structural `Authorization` redaction), `select-provider.ts` (host-glob matching via picomatch with `excludeHost`), `expand-macro.ts` (YAML loader + deep-merge expander), `resolve-token.ts` (ordered env / `safeExecResult`-backed argv-command sources, first-non-empty wins, no shell), and `resolve.ts` (the public `resolveAuthenticatedUrl(url, config)` entry returning one of `{fetchUrl, headers}` / `{outcome: 'unsupported'}` / `{outcome: 'unverified', reason}`). Ships the `github` and `sharepoint` macros as a YAML data asset (`src/link-auth/macros.yaml`), with a new cross-platform `packages/dev-tools/src/copy-yaml-assets.ts` post-build step bundling `.yaml` into `dist/` — first YAML-asset shipping pattern in the utils package. Adds `yaml` as a utils dependency. Companion Zod schema in `@vibe-agent-toolkit/resources` (`src/schemas/link-auth.ts`) validates the `resources.linkAuth` config block (strict; accepts either `{ use: <macro>, ...overrides }` or full inline providers), wired as an optional field on `ResourcesConfigSchema`. 140 unit tests in utils + 29 schema tests in resources, all pure-logic with no network or filesystem dependencies; security-load-bearing tests pin the closed-allowlist guarantee, the `${__proto__}` bypass defense, the token-never-leaks-into-Authorization invariant, and `shell: false` literal-argv handling.
- **Empirical compatibility harness (`packages/dev-tools/src/compat-empirical/`).** Per-#100 research scaffold: a CLI (`predict`/`run`/`judge`/`report`/`all`) that runs candidate skills against claude-code, claude-cowork, and claude-chat, then joins VAT's static predictions with deterministic runtime observations and an LLM-judge semantic read into a reality-vs-prediction matrix. The output is an evidence artifact a follow-up PR will draw on to propose detector improvements, each citing specific (skill, runtime) cells. No detector code or `RUNTIME_PROFILES` changes here. Lives in the private `@vibe-agent-toolkit/dev-tools` package no adopter-facing surface.
- **Empirical compat harness v2 (`packages/dev-tools/src/compat-empirical/`).** Foundations PR per [the v2 design](./docs/research/2026-05-23-compat-empirical-harness-v2-design.md). Probe coverage: multi-prompt + repeat-N with adaptive N=3→N=5 extension, mandatory positive+negative prompt pairing per corpus entry, and negative-prompt agreement inversion so false-positive triggers surface as `vat-optimistic`. Evidence quality: deterministic class widened from 6 to 9 values (splitting `error` into `install-failed`/`runtime-error`, `not-invoked` into `not-invoked-engaged`/`not-invoked-empty`, adding `refused`), judge prompt rewritten to v2 with a `refused` verdict. Report fidelity: coverage stats, per-bucket headline (own/official/community × ran/agree/optimistic/pessimistic/gray-zone), gray-zone (mixed-signal) and high-variance subsections, per-attempt variance rendered inline (`runtime-error (2/3) / failed (3/3)`). Judge replay: persisted `judge-calls/<skillId>-<promptId>-<target>-<attemptIdx>.json` artifacts plus a new `re-judge` subcommand that re-executes them against an optionally different model or freshly-edited system prompt without re-spending operator hours on the runtime side. Two PR-#108 deferred bug fixes also landed: `git fetch --tags --force` before named-ref fetch (annotated tag refresh) and `setup()` teardown-first idempotency for the manual driver. Still private to `@vibe-agent-toolkit/dev-tools`; corpus authoring, first real run, and the docs deliverable are the downstream work.
- **Corpus seed expanded from 9 → 237 entries via a new committed importer at `packages/dev-tools/src/import-marketplace.ts` (`bun run import-marketplace [--allow-shrink]`).** The script fetches `.claude-plugin/marketplace.json` from `anthropics/claude-plugins-official` (205 of 209 raw entries kept) and `anthropics/knowledge-work-plugins` (30 of 60 — the knowledge-work catalog turns out to be ≈50% mirror entries of the official catalog) via `gh api`, maps each upstream entry to a `PluginEntry`, deduplicates by `source` URL (preserved VAT-owned entries always win; otherwise alphabetical-first-name wins within each duplicate cluster), and rewrites `corpus/seed.yaml`. Mapping rules: `bucket: official` uniformly (both catalogs are anthropics-curated marketplaces — `bucket` is the *reporting posture* per slice 1a, not code provenance); `confidence: first-party` for catalog-internal string sources and `github.com/anthropics/...` object sources, else `curated`; the `./partner-built/` knowledge-work convention overrides to `curated`; `maturity: production` for all entries. URL composition handles all five upstream source shapes (string, `git-subdir` ± `ref`, `url` ± `path`, `github`), throwing on unknown discriminators. The seven sample entries from slice 1a are regenerated from upstream manifests on every re-import. Re-import safety: the importer refuses to overwrite `corpus/seed.yaml` if either upstream catalog returned 0 plugins or the new entry count would drop more than 20% vs. the existing seed; `--allow-shrink` bypasses both gates for the rare case where shrinkage is real. The generated `seed.yaml` header dropped its earlier per-entry `validation:` claim (the importer throws on validation blocks today) and now states explicitly that entry `source` URLs pin a fragment ref (typically the default branch), not a per-entry commit SHA — the catalog SHAs in the header are this run's audit provenance. Issue #99 slice 1b — follows the schema change from PR #111 (slice 1a).
- **Empirical compatibility harness (`packages/dev-tools/src/compat-empirical/`).** Per-#100 research scaffold for measuring skill compatibility across `claude-code`, `claude-cowork`, and `claude-chat`: a CLI (`predict`/`run`/`judge`/`report`/`all`) that joins VAT's static predictions with deterministic runtime observations and an LLM-judge semantic read into a reality-vs-prediction matrix — an evidence artifact for proposing detector improvements that each cite specific (skill, runtime) cells. Probe coverage: multi-prompt + repeat-N with adaptive N=3→N=5 extension, mandatory positive+negative prompt pairing per corpus entry, and negative-prompt agreement inversion so false-positive triggers surface as `vat-optimistic`. Evidence quality: the deterministic class is widened from 6 to 9 values (splitting `error` into `install-failed`/`runtime-error`, `not-invoked` into `not-invoked-engaged`/`not-invoked-empty`, adding `refused`), with a v2 judge prompt that adds a `refused` verdict. Report fidelity: coverage stats, per-bucket headline (own/official/community × ran/agree/optimistic/pessimistic/gray-zone), gray-zone (mixed-signal) and high-variance subsections, and per-attempt variance rendered inline (`runtime-error (2/3) / failed (3/3)`). Judge replay persists `judge-calls/<skillId>-<promptId>-<target>-<attemptIdx>.json` artifacts that a new `re-judge` subcommand re-executes against an optionally different model or freshly-edited system prompt — without re-spending operator hours on the runtime side. Also landed: `git fetch --tags --force` before named-ref fetch (annotated tag refresh) and `setup()` teardown-first idempotency for the manual driver. No detector code or `RUNTIME_PROFILES` changes; lives entirely in the private `@vibe-agent-toolkit/dev-tools` package with no adopter-facing surface. Design: [the v2 harness design](./docs/research/2026-05-23-compat-empirical-harness-v2-design.md). Corpus authoring, the first real run, and the docs deliverable are the downstream work.
- **Cowork driver spike.** Added [`docs/contributing/cowork-driver-spike.md`](docs/contributing/cowork-driver-spike.md) — a time-boxed investigation (per §4a of the harness v2 design) of whether `claude-cowork` can be driven programmatically by the empirical compat harness today. Verdict: **not feasible**; cowork is a Claude Desktop app product with no public API/CLI surface. The `claude-cowork` runtime stays on `scripted-assisted` until Anthropic ships a Cowork CLI mode, Sessions API, or documented filesystem-import path. Adjacent finding (not a cowork replacement): the public-beta Skills API (`POST /v1/skills` + `container.skills[]` on `/v1/messages`) supports a fully-automatable *new* runtime — captured in the spike doc as a potential follow-up, gated on a separate design decision.
- **Subscription-only compat harness billing.** The harness now bills a Claude Pro/Max subscription instead of the API: both token-consuming surfaces (the `claude-code` runtime driver and the LLM judge) route through one shared `claude` CLI invoker (`runtimes/shared/claude-cli.ts`) that injects the operator's `CLAUDE_CODE_OAUTH_TOKEN` and deletes every API credential from the child env, so the CLI cannot fall back to API billing. The operator's own token is sourced at preflight — env var if set, otherwise an interactive prompt — so a run only ever spends the operator's personal plan. The judge was migrated off `@anthropic-ai/sdk` (dependency removed) onto the CLI, parsing a strict JSON verdict with one retry instead of the SDK's forced-tool call (`judge-system.md` now asks for a JSON object). `RunMetadata` gains `authMode` and the report methodology discloses subscription auth + parsed-not-forced verdicts. Premise (zero API billing under the OAuth token) still pending the manual smoke test.

Expand Down
Loading
Loading