Skip to content

Empirical Chat | Cowork | Code compat research #100

@jdutton

Description

@jdutton

Type

Research / evidence-producing. This issue does not ship detector code or runtime-profile changes. It ships an empirical compat report — a reality-vs-prediction matrix backed by actually running candidate skills in each runtime — that becomes the evidence base for a separate detector-improvement PR.

Background

VAT ships a v1 compat framework (added in 0.1.31, refactored to an evidence substrate in 0.1.32):

  • CAPABILITY_LOCAL_SHELL, CAPABILITY_EXTERNAL_CLI, CAPABILITY_BROWSER_AUTH (info — observations)
  • COMPAT_TARGET_INCOMPATIBLE, COMPAT_TARGET_NEEDS_REVIEW (warning — verdicts)
  • COMPAT_TARGET_UNDECLARED (info)
  • RUNTIME_PROFILES table in @vibe-agent-toolkit/claude-marketplace declaring what each runtime provides

The framework is theoretical — built from runtime docs and our intuition. It has never been validated against actual runtime behavior on a meaningful sample. That is what this issue closes.

Goal

Pressure-test VAT's compat detection against real candidate skills running in Chat, Cowork, and Claude Code. Use the findings to refine what "compatible with Chat | Cowork | Code" actually means and how VAT should detect and message it.

The output is evidence and documentation, not code changes. A follow-up PR proposes detector improvements, citing this issue's evidence.

Tasks

  • Pick 15–25 candidate skills/plugins spanning the capability surface: pure-prose skills, shell-using skills, MCP-using skills, browser-auth flows, network-heavy skills. Mix of official, community, and our own adopters.
  • Run each one in all three runtimes (Chat, Cowork, Code) where the runtime permits installation. Document what actually happens:
    • Does it install? Does it trigger? Does it complete?
    • When it fails, what is the failure mode? (Silent skip? Error? Hang? Wrong output?)
  • Compare to VAT's prediction. For each (skill × runtime) pair, record:
    • VAT verdict: expected / INCOMPATIBLE / NEEDS_REVIEW / UNDECLARED
    • Reality: works / partially works / fails-loudly / fails-silently
  • Build a confusion matrix. True positives, false positives, false negatives. This is the empirical evidence base for either expanding RUNTIME_PROFILES, adding new CAPABILITY_* detectors, or demoting/promoting verdict severities.
  • Document the failure taxonomy. What does "incompatible" actually look like in each runtime? Write this up as docs/runtime-compatibility-empirical.md — a companion to docs/skill-quality-and-compatibility.md grounded in observation, not theory. Identify which patterns deserve their own CAPABILITY_* detector vs. which are too rare to merit one.
  • Propose detector improvements as a follow-up PR list. Each proposed change cites the specific (skill, runtime) evidence that justifies it. Honor the rule-addition bar in docs/validation-rule-design.md — corpus evidence first.
  • Feed findings back into the sister scanner issue. Any pattern identified here that should become a recurring detector flows into the scanner's report schema. Any skill tested here is a candidate for the official/reviewed corpus seed.

Non-goals

  • No code changes to detectors in the first pass — evidence first, then a separate proposal PR
  • No compat-badge work (separate effort, blocked on this stream)
  • No grading or ranking of the tested skills
  • No RUNTIME_PROFILES edits in this issue — those land in the detector-improvement follow-up

Acceptance criteria

  • A committed empirical compat report covering ≥15 skills × 3 runtimes with a reality-vs-prediction matrix
  • ≥3 actionable detector-improvement proposals, each backed by named evidence (for skills in the official bucket) or pattern descriptions (for community-bucket skills)
  • docs/runtime-compatibility-empirical.md lands as a companion to the existing stance doc

Reading order

  1. docs/skill-quality-and-compatibility.md — current stance
  2. docs/validation-codes.md (compat section) — current verdict semantics
  3. packages/claude-marketplace/src/runtime-profiles.ts — what we currently claim each runtime supports
  4. packages/agent-skills/src/evidence/ — the detector machinery the evidence will eventually inform

Discipline reminders

  • Two-bucket discipline applies here too: named findings for official/reviewed skills, aggregate-only for community
  • Evidence before rules; humble defaults; configurability-first
  • Non-judgment language — describe patterns, not authors
  • [vat: ...] attribution prefix on every surface-level finding in the report

Sister issue

#99 — community corpus scanner foundation — produces a complementary evidence base (rule-firing reality across the broader corpus). Findings from this issue should land before #99's triage doc so triage can incorporate compat-verdict reality alongside rule-firing reality.

Metadata

Metadata

Labels

documentationImprovements or additions to documentationenhancementNew feature or requesthelp wantedExtra attention is needed

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions