You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Research / evidence-producing. This issue does not ship detector code or runtime-profile changes. It ships an empirical compat report — a reality-vs-prediction matrix backed by actually running candidate skills in each runtime — that becomes the evidence base for a separate detector-improvement PR.
Background
VAT ships a v1 compat framework (added in 0.1.31, refactored to an evidence substrate in 0.1.32):
RUNTIME_PROFILES table in @vibe-agent-toolkit/claude-marketplace declaring what each runtime provides
The framework is theoretical — built from runtime docs and our intuition. It has never been validated against actual runtime behavior on a meaningful sample. That is what this issue closes.
Goal
Pressure-test VAT's compat detection against real candidate skills running in Chat, Cowork, and Claude Code. Use the findings to refine what "compatible with Chat | Cowork | Code" actually means and how VAT should detect and message it.
The output is evidence and documentation, not code changes. A follow-up PR proposes detector improvements, citing this issue's evidence.
Tasks
Pick 15–25 candidate skills/plugins spanning the capability surface: pure-prose skills, shell-using skills, MCP-using skills, browser-auth flows, network-heavy skills. Mix of official, community, and our own adopters.
Run each one in all three runtimes (Chat, Cowork, Code) where the runtime permits installation. Document what actually happens:
Does it install? Does it trigger? Does it complete?
When it fails, what is the failure mode? (Silent skip? Error? Hang? Wrong output?)
Compare to VAT's prediction. For each (skill × runtime) pair, record:
Reality: works / partially works / fails-loudly / fails-silently
Build a confusion matrix. True positives, false positives, false negatives. This is the empirical evidence base for either expanding RUNTIME_PROFILES, adding new CAPABILITY_* detectors, or demoting/promoting verdict severities.
Document the failure taxonomy. What does "incompatible" actually look like in each runtime? Write this up as docs/runtime-compatibility-empirical.md — a companion to docs/skill-quality-and-compatibility.md grounded in observation, not theory. Identify which patterns deserve their own CAPABILITY_* detector vs. which are too rare to merit one.
Propose detector improvements as a follow-up PR list. Each proposed change cites the specific (skill, runtime) evidence that justifies it. Honor the rule-addition bar in docs/validation-rule-design.md — corpus evidence first.
Feed findings back into the sister scanner issue. Any pattern identified here that should become a recurring detector flows into the scanner's report schema. Any skill tested here is a candidate for the official/reviewed corpus seed.
Non-goals
No code changes to detectors in the first pass — evidence first, then a separate proposal PR
No compat-badge work (separate effort, blocked on this stream)
No grading or ranking of the tested skills
No RUNTIME_PROFILES edits in this issue — those land in the detector-improvement follow-up
Acceptance criteria
A committed empirical compat report covering ≥15 skills × 3 runtimes with a reality-vs-prediction matrix
≥3 actionable detector-improvement proposals, each backed by named evidence (for skills in the official bucket) or pattern descriptions (for community-bucket skills)
docs/runtime-compatibility-empirical.md lands as a companion to the existing stance doc
Reading order
docs/skill-quality-and-compatibility.md — current stance
docs/validation-codes.md (compat section) — current verdict semantics
packages/claude-marketplace/src/runtime-profiles.ts — what we currently claim each runtime supports
packages/agent-skills/src/evidence/ — the detector machinery the evidence will eventually inform
Discipline reminders
Two-bucket discipline applies here too: named findings for official/reviewed skills, aggregate-only for community
Evidence before rules; humble defaults; configurability-first
Non-judgment language — describe patterns, not authors
[vat: ...] attribution prefix on every surface-level finding in the report
Sister issue
#99 — community corpus scanner foundation — produces a complementary evidence base (rule-firing reality across the broader corpus). Findings from this issue should land before#99's triage doc so triage can incorporate compat-verdict reality alongside rule-firing reality.
Type
Research / evidence-producing. This issue does not ship detector code or runtime-profile changes. It ships an empirical compat report — a reality-vs-prediction matrix backed by actually running candidate skills in each runtime — that becomes the evidence base for a separate detector-improvement PR.
Background
VAT ships a v1 compat framework (added in 0.1.31, refactored to an evidence substrate in 0.1.32):
CAPABILITY_LOCAL_SHELL,CAPABILITY_EXTERNAL_CLI,CAPABILITY_BROWSER_AUTH(info — observations)COMPAT_TARGET_INCOMPATIBLE,COMPAT_TARGET_NEEDS_REVIEW(warning — verdicts)COMPAT_TARGET_UNDECLARED(info)RUNTIME_PROFILEStable in@vibe-agent-toolkit/claude-marketplacedeclaring what each runtime providesThe framework is theoretical — built from runtime docs and our intuition. It has never been validated against actual runtime behavior on a meaningful sample. That is what this issue closes.
Goal
Pressure-test VAT's compat detection against real candidate skills running in Chat, Cowork, and Claude Code. Use the findings to refine what "compatible with Chat | Cowork | Code" actually means and how VAT should detect and message it.
The output is evidence and documentation, not code changes. A follow-up PR proposes detector improvements, citing this issue's evidence.
Tasks
expected/INCOMPATIBLE/NEEDS_REVIEW/UNDECLAREDRUNTIME_PROFILES, adding newCAPABILITY_*detectors, or demoting/promoting verdict severities.docs/runtime-compatibility-empirical.md— a companion todocs/skill-quality-and-compatibility.mdgrounded in observation, not theory. Identify which patterns deserve their ownCAPABILITY_*detector vs. which are too rare to merit one.docs/validation-rule-design.md— corpus evidence first.Non-goals
RUNTIME_PROFILESedits in this issue — those land in the detector-improvement follow-upAcceptance criteria
docs/runtime-compatibility-empirical.mdlands as a companion to the existing stance docReading order
docs/skill-quality-and-compatibility.md— current stancedocs/validation-codes.md(compat section) — current verdict semanticspackages/claude-marketplace/src/runtime-profiles.ts— what we currently claim each runtime supportspackages/agent-skills/src/evidence/— the detector machinery the evidence will eventually informDiscipline reminders
[vat: ...]attribution prefix on every surface-level finding in the reportSister issue
#99 — community corpus scanner foundation — produces a complementary evidence base (rule-firing reality across the broader corpus). Findings from this issue should land before #99's triage doc so triage can incorporate compat-verdict reality alongside rule-firing reality.