The pipeline (brainstorming → prototype? → writing-plans → @tdd per task → verification → /check → @code-review, with @architect inserted for cross-cutting work and @debug prepended for bugs) is sound, and the AGENTS.md summary plus the build-agent prompt's "enforcement posture" list give it two independent anchors. Specific gaps, roughly in pipeline order:
-
No execution/orchestration skill between plan and tasks. writing-plans ends by offering "task-by-task via @tdd (recommended)" or "inline execution," but there is no skill defining either mode: what the parent does between tasks, what review happens per task (spec-compliance vs code-quality), what to do when a task's tests won't go green after N attempts, when to halt vs re-plan, how context is managed across a 15-task plan. Superpowers ships this as two skills (executing-plans for batch-with-checkpoints, subagent-driven-development for dispatch-with-two-stage-review), and it's the single most consequential content gap in your harness — it's where multi-hour autonomous runs either hold together or wander. Direct content port, no mechanics involved (§2.1).
-
Enforcement is prompt-level only. The hard-gate lives in a skill the model must choose to load, backed by the build prompt. Nothing structural prevents a fresh or degraded session from editing source directly — the hooks only catch lint/format/secrets at commit time, which is after the damage. Superpowers attacks this with (a) a session-start hook that injects the using-superpowers bootstrap (and re-injects after compaction) and (b) that bootstrap's rationalization red-flags table ("'This is just a simple question' → Questions are tasks; 'I'll just do this one thing first' → Check BEFORE doing anything"), which is empirically the highest-leverage anti-drift text in the ecosystem. Your build prompt covers the triggers; it doesn't inoculate against the rationalizations. Port the table into the build prompt or a tiny always-loaded doc (low effort), and consider the session-start injection via an OpenCode plugin later (§2.1).
-
The /check coverage gate is under-specified. "Gate: ≥ 80% line coverage on changed files" — but pest --coverage reports per-file totals, not changed-file coverage; computing the gate as written requires intersecting git diff --name-only with the coverage report, which the command never spells out. An agent will either approximate (global ≥80%) or hand-wave. Spell out the mechanics (e.g., --coverage --min=80 for the global floor plus an explicit per-changed-file table the agent assembles from git diff + the coverage output) or relax the wording to match what the tool measures.
-
Verification vs /check duplication is acceptable but should be named. Both run lint + tests; one is per-task, one is pre-push. A single sentence in each ("verification is the per-task gate; /check is the aggregate pre-push gate — both run because task-level green can rot by push time") prevents a future "optimization" that removes one.
-
The handoff/resume loop is complete and good — /handoff + context-management doc's degradation thresholds + "read the handoff and continue" resume protocol is better than most comparison repos. The context-management doc's rewind-over-correct and compact-with-a-hint guidance is state of the art.
-
Eval framework is honest scaffolding, but its runner is fictional. opencode eval is not an OpenCode command; the README says execution is "pending API access," but the command block reads as if it exists. For 🌍 credibility, either mark it clearly aspirational, or (better, §2.1/§2.5) replace it with a real scripts/run-evals.sh that drives opencode run non-interactively against each case and greps for the expected behaviors — Superpowers' tests/ tree (which includes an opencode/ suite) is a working reference implementation you can crib directly.
The pipeline (
brainstorming → prototype? → writing-plans → @tdd per task → verification → /check → @code-review, with@architectinserted for cross-cutting work and@debugprepended for bugs) is sound, and the AGENTS.md summary plus the build-agent prompt's "enforcement posture" list give it two independent anchors. Specific gaps, roughly in pipeline order:No execution/orchestration skill between plan and tasks.
writing-plansends by offering "task-by-task via @tdd (recommended)" or "inline execution," but there is no skill defining either mode: what the parent does between tasks, what review happens per task (spec-compliance vs code-quality), what to do when a task's tests won't go green after N attempts, when to halt vs re-plan, how context is managed across a 15-task plan. Superpowers ships this as two skills (executing-plansfor batch-with-checkpoints,subagent-driven-developmentfor dispatch-with-two-stage-review), and it's the single most consequential content gap in your harness — it's where multi-hour autonomous runs either hold together or wander. Direct content port, no mechanics involved (§2.1).Enforcement is prompt-level only. The hard-gate lives in a skill the model must choose to load, backed by the build prompt. Nothing structural prevents a fresh or degraded session from editing source directly — the hooks only catch lint/format/secrets at commit time, which is after the damage. Superpowers attacks this with (a) a session-start hook that injects the
using-superpowersbootstrap (and re-injects after compaction) and (b) that bootstrap's rationalization red-flags table ("'This is just a simple question' → Questions are tasks; 'I'll just do this one thing first' → Check BEFORE doing anything"), which is empirically the highest-leverage anti-drift text in the ecosystem. Your build prompt covers the triggers; it doesn't inoculate against the rationalizations. Port the table into the build prompt or a tiny always-loaded doc (low effort), and consider the session-start injection via an OpenCode plugin later (§2.1).The
/checkcoverage gate is under-specified. "Gate: ≥ 80% line coverage on changed files" — butpest --coveragereports per-file totals, not changed-file coverage; computing the gate as written requires intersectinggit diff --name-onlywith the coverage report, which the command never spells out. An agent will either approximate (global ≥80%) or hand-wave. Spell out the mechanics (e.g.,--coverage --min=80for the global floor plus an explicit per-changed-file table the agent assembles fromgit diff+ the coverage output) or relax the wording to match what the tool measures.Verification vs
/checkduplication is acceptable but should be named. Both run lint + tests; one is per-task, one is pre-push. A single sentence in each ("verification is the per-task gate; /check is the aggregate pre-push gate — both run because task-level green can rot by push time") prevents a future "optimization" that removes one.The handoff/resume loop is complete and good —
/handoff+ context-management doc's degradation thresholds + "read the handoff and continue" resume protocol is better than most comparison repos. The context-management doc's rewind-over-correct and compact-with-a-hint guidance is state of the art.Eval framework is honest scaffolding, but its runner is fictional.
opencode evalis not an OpenCode command; the README says execution is "pending API access," but the command block reads as if it exists. For 🌍 credibility, either mark it clearly aspirational, or (better, §2.1/§2.5) replace it with a realscripts/run-evals.shthat drivesopencode runnon-interactively against each case and greps for the expected behaviors — Superpowers'tests/tree (which includes anopencode/suite) is a working reference implementation you can crib directly.