Summary
Two policy changes to the gen-plan command (commands/gen-plan.md) so generated plans match a human-in-the-loop, low-ceremony workflow.
1. No go/no-go gates — humans judge
Generated plans should not contain automated go/no-go gates, pass/fail thresholds, or stopping rules that decide success based on hitting a number. Quantitative metrics should be recorded as reference targets and measured evidence; a human reviews the evidence and makes every accept / proceed / pivot decision.
Affected today:
- Step 3 (Confirm Quantitative Metrics) currently frames each metric as a "hard requirement that must be achieved for the implementation to be considered successful" vs. an optimization trend — this produces go/no-go gates in the acceptance criteria.
- Rule 7 (TDD-Style Tests) says tests "enable deterministic verification", which for quantitative/statistical criteria reads as an auto pass/fail gate.
2. Lightweight statistics — p < 0.05 is enough
For any statistical comparison, a single significance test at p < 0.05 is sufficient. The command should NOT require or generate:
- bootstrap confidence intervals (per-item resampling, 10000 draws, 95% CI)
- per-item McNemar tests
- minimum-effect-size sidebars (e.g. "+1pp")
- separately reported robustness seeds
...unless the user explicitly asks for that extra rigor.
Proposed changes
- Reword Step 3 so metrics are reference targets / evidence for human judgment, never auto-gates; add the p<0.05 statistics note.
- Amend Rule 7 to clarify quantitative/statistical criteria describe how to measure and report (not auto-gate).
- Add Rule 15 (No Go/No-Go Gates — Human Judgment) and Rule 16 (Lightweight Statistics).
Same policy should be mirrored into the Codex-side copy of the skill (skills/humanize-gen-plan/SKILL.md) for consistency.
Notes
I've applied this locally to validate it; happy to send a PR if the direction looks good.
Summary
Two policy changes to the
gen-plancommand (commands/gen-plan.md) so generated plans match a human-in-the-loop, low-ceremony workflow.1. No go/no-go gates — humans judge
Generated plans should not contain automated go/no-go gates, pass/fail thresholds, or stopping rules that decide success based on hitting a number. Quantitative metrics should be recorded as reference targets and measured evidence; a human reviews the evidence and makes every accept / proceed / pivot decision.
Affected today:
2. Lightweight statistics — p < 0.05 is enough
For any statistical comparison, a single significance test at p < 0.05 is sufficient. The command should NOT require or generate:
...unless the user explicitly asks for that extra rigor.
Proposed changes
Same policy should be mirrored into the Codex-side copy of the skill (
skills/humanize-gen-plan/SKILL.md) for consistency.Notes
I've applied this locally to validate it; happy to send a PR if the direction looks good.