Empirical Chat | Cowork | Code compat research

## Type

**Research / evidence-producing.** This issue does not ship detector code or runtime-profile changes. It ships an **empirical compat report** — a reality-vs-prediction matrix backed by actually running candidate skills in each runtime — that becomes the evidence base for a separate detector-improvement PR.

## Background

VAT ships a v1 compat framework (added in 0.1.31, refactored to an evidence substrate in 0.1.32):

- `CAPABILITY_LOCAL_SHELL`, `CAPABILITY_EXTERNAL_CLI`, `CAPABILITY_BROWSER_AUTH` (info — observations)
- `COMPAT_TARGET_INCOMPATIBLE`, `COMPAT_TARGET_NEEDS_REVIEW` (warning — verdicts)
- `COMPAT_TARGET_UNDECLARED` (info)
- `RUNTIME_PROFILES` table in `@vibe-agent-toolkit/claude-marketplace` declaring what each runtime provides

The framework is **theoretical** — built from runtime docs and our intuition. It has never been validated against actual runtime behavior on a meaningful sample. That is what this issue closes.

## Goal

Pressure-test VAT's compat detection against real candidate skills running in Chat, Cowork, and Claude Code. Use the findings to refine **what "compatible with Chat | Cowork | Code" actually means** and how VAT should detect and message it.

The output is evidence and documentation, not code changes. A follow-up PR proposes detector improvements, citing this issue's evidence.

## Tasks

- [ ] **Pick 15–25 candidate skills/plugins** spanning the capability surface: pure-prose skills, shell-using skills, MCP-using skills, browser-auth flows, network-heavy skills. Mix of official, community, and our own adopters.
- [ ] **Run each one in all three runtimes** (Chat, Cowork, Code) where the runtime permits installation. Document what actually happens:
  - Does it install? Does it trigger? Does it complete?
  - When it fails, what is the failure mode? (Silent skip? Error? Hang? Wrong output?)
- [ ] **Compare to VAT's prediction.** For each (skill × runtime) pair, record:
  - VAT verdict: `expected` / `INCOMPATIBLE` / `NEEDS_REVIEW` / `UNDECLARED`
  - Reality: works / partially works / fails-loudly / fails-silently
- [ ] **Build a confusion matrix.** True positives, false positives, false negatives. This is the empirical evidence base for either expanding `RUNTIME_PROFILES`, adding new `CAPABILITY_*` detectors, or demoting/promoting verdict severities.
- [ ] **Document the failure taxonomy.** What does "incompatible" actually look like in each runtime? Write this up as `docs/runtime-compatibility-empirical.md` — a companion to `docs/skill-quality-and-compatibility.md` grounded in observation, not theory. Identify which patterns deserve their own `CAPABILITY_*` detector vs. which are too rare to merit one.
- [ ] **Propose detector improvements** as a follow-up PR list. Each proposed change cites the specific (skill, runtime) evidence that justifies it. Honor the rule-addition bar in `docs/validation-rule-design.md` — corpus evidence first.
- [ ] **Feed findings back into the sister scanner issue.** Any pattern identified here that should become a recurring detector flows into the scanner's report schema. Any skill tested here is a candidate for the official/reviewed corpus seed.

## Non-goals

- No code changes to detectors in the first pass — evidence first, then a separate proposal PR
- No compat-badge work (separate effort, blocked on this stream)
- No grading or ranking of the tested skills
- No `RUNTIME_PROFILES` edits in this issue — those land in the detector-improvement follow-up

## Acceptance criteria

- A committed empirical compat report covering ≥15 skills × 3 runtimes with a reality-vs-prediction matrix
- ≥3 actionable detector-improvement proposals, each backed by named evidence (for skills in the official bucket) or pattern descriptions (for community-bucket skills)
- `docs/runtime-compatibility-empirical.md` lands as a companion to the existing stance doc

## Reading order

1. `docs/skill-quality-and-compatibility.md` — current stance
2. `docs/validation-codes.md` (compat section) — current verdict semantics
3. `packages/claude-marketplace/src/runtime-profiles.ts` — what we currently claim each runtime supports
4. `packages/agent-skills/src/evidence/` — the detector machinery the evidence will eventually inform

## Discipline reminders

- Two-bucket discipline applies here too: named findings for official/reviewed skills, aggregate-only for community
- Evidence before rules; humble defaults; configurability-first
- Non-judgment language — describe patterns, not authors
- `[vat: ...]` attribution prefix on every surface-level finding in the report

## Sister issue

#99 — community corpus scanner foundation — produces a complementary evidence base (rule-firing reality across the broader corpus). Findings from this issue should land **before** #99's triage doc so triage can incorporate compat-verdict reality alongside rule-firing reality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empirical Chat | Cowork | Code compat research #100

Type

Background

Goal

Tasks

Non-goals

Acceptance criteria

Reading order

Discipline reminders

Sister issue

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Empirical Chat | Cowork | Code compat research #100

Description

Type

Background

Goal

Tasks

Non-goals

Acceptance criteria

Reading order

Discipline reminders

Sister issue

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions