Skip to content

Whole-skill outcome-eval harness + map-task body hardening + skill eval-sets#160

Merged
azalio merged 6 commits into
mainfrom
organ-pipe-pyrite
Jun 5, 2026
Merged

Whole-skill outcome-eval harness + map-task body hardening + skill eval-sets#160
azalio merged 6 commits into
mainfrom
organ-pipe-pyrite

Conversation

@azalio
Copy link
Copy Markdown
Owner

@azalio azalio commented Jun 5, 2026

What

Adds a whole-skill optimization capability — measuring and improving a skill's body (its SKILL.md logic), not just its trigger description: — and applies it to the pilot skill map-task. Approach B (harness measures → human edits), body-only mutation.

Harness + fixtures (reusable flow for any skill)

  • tests/skills_eval/whole_skill/spike_runner.py — seeds an isolated temp (.claude + .map/scripts + fixture repo, git init), runs claude -p "/map-task ST-001", and scores the outcome with a hybrid metric: deterministic gates (scope fidelity via git diff, task-pass via the fixture's tests) + a trace-cited LLM judge. QUALITY = gate·(0.5 + 0.5·judge). expected_outcome-aware (complete|blocked).
  • Fixtures under tests/skills_eval/fixtures/whole_skill/: a scope-trap and an impossible/blocker mini-repo (each a real git project + committed .map/ plan/blueprint).

Key empirical finding (2 fixtures, 12 runs, 2 llm-council consults)

Generic scope/blocker prose in a thin-orchestration body is low-leverage: Body-Good vs Body-Bad (rules stripped) scored identically 1.0 — those behaviors are enforced by the shared actor/monitor agents, not the body. The body's real lever is sequencing, context relay, retry/termination, and the final report contract. (To move scope/correctness you must optimize the shared agent prompts — out of this PR's body-only scope.)

map-task body hardening (body-owned surfaces) + regression proof

  • Formalized the Outcome Report (required fields) and added the previously-missing BLOCKED outcome report; explicit retry-exhaustion / impossible-in-scope termination (stop + report blocker, never fake-complete or expand scope).
  • Fixed real defects: dead "What this command CANNOT do" reference, placeholder example, awkward artifact section.
  • Regression-proved: QUALITY 1.0 on both fixtures ×3 after the edit (no outcome regression). Honest claim: cleaner/more-complete body, not a coding-quality gain.

Also

  • docs/whole-skill-optimization-{notes,flow}.md (method + the reusable flow for other skills) and docs/SKILL-EVAL.md (description-tuning engine guide) + USAGE pointer.
  • 13 description-optimize eval-sets for the remaining /map-* skills.
  • Tooling hygiene: exclude the intentionally-broken whole_skill fixture mini-repos from pytest/ruff/pyright/mypy.

Validation

make check: 2257 passed, 3 skipped; ruff/mypy/pyright clean; check-render byte-identical. Whole-skill measurements run via claude -p (subscription auth).

🤖 Generated with Claude Code

azalio and others added 6 commits June 5, 2026 03:32
…rdening + skill eval-sets

Whole-skill optimization (pilot: map-task), Approach B (measure -> human-edit), body-only:

- Harness: tests/skills_eval/whole_skill/spike_runner.py - seeds an isolated temp
  (.claude + .map/scripts + fixture repo, git init), runs `claude -p /map-task ST-001`
  with hybrid scoring (deterministic scope/task gates from git diff + a trace-cited LLM
  judge; QUALITY = gate*(0.5+0.5*judge)); expected_outcome-aware (complete|blocked).
- Fixtures: scope-trap (F2) + impossible/blocker (F3) under
  tests/skills_eval/fixtures/whole_skill/ (real mini-repos + committed .map plan/blueprint).
- Finding (2 fixtures, 12 runs, 2 llm-council consults): generic scope/blocker PROSE in a
  thin-orchestration body is LOW-LEVERAGE - body-good == body-bad == QUALITY 1.0, because the
  shared actor/monitor agents enforce those behaviors. The body's real lever is sequencing,
  context relay, retry/termination, and the final report contract.
- map-task body hardening (body-owned surfaces): formalized the Outcome Report (COMPLETE +
  a new BLOCKED schema with required fields), explicit retry-exhaustion/impossible-in-scope
  termination; fixed a dead reference, a placeholder example, and an awkward artifact section.
  Regression-proved: QUALITY 1.0 on F1+F3 (no outcome regression). Honest claim: cleaner/more
  complete body, no regression - not a coding-quality gain (that needs the shared agent prompts).
- Docs: docs/whole-skill-optimization-{notes,flow}.md (method + reusable flow for other skills);
  docs/SKILL-EVAL.md (description-tuning engine guide) + USAGE.md pointer.
- 13 description-optimize eval-sets for the remaining /map-* skills (tests/skills_eval/fixtures/).
- Tooling hygiene: exclude the whole_skill fixture mini-repos from pytest/ruff/pyright/mypy
  (they are intentionally-broken seeded repos).

make check: 2257 passed, 3 skipped; ruff/mypy/pyright clean; check-render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ow-leverage (body AND actor)

Extend the whole-skill harness with `--degrade {body,actor,monitor}` and add a strong
scope-pressure fixture (the obvious one-line fix is out-of-scope), then ablate the ACTOR prompt:

- spike_runner.py: `--degrade` targets which prompt the 'bad' variant degrades; `_degrade_actor`
  strips actor.md's Mutation Boundary section + the quick-ref NEVER-scope clause (seed-only);
  `_degrade_monitor` stub (best-effort).
- Fixture tests/skills_eval/fixtures/whole_skill/map_task_scope_pressure: RATE trap — changing the
  shared RATE in config.py is the tempting out-of-scope fix; the in-scope fix is a surcharge in utils.py.

Result (current actor vs --degrade actor, 3 runs each): BOTH kept scope perfectly (config.py never
touched; only utils.py edited). The QUALITY delta (0.80 vs 1.00) is judge NOISE (inverted; the judge
penalized the current actor for lacking verbose scope-reasoning prose despite perfect actual scope).

Consolidated across 3 ablations (body + actor, 18 runs): prose-level scope discipline is low-leverage;
scope is governed by the blueprint affected_files contract + base-model competence + the mechanical
mutation-boundary/test-gate/monitor. Methodology note recorded: for scope, trust the deterministic
gate — the scope_discipline judge dimension is verbosity-biased. Full log in
docs/whole-skill-optimization-notes.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on_boundary

Option B (verify/strengthen the mechanical scope lever). The mutation-boundary
validator is correct + warn-only by design (strict via MAP_STRICT_SCOPE), but its
tests only exercised committed/staged extra files. Add a regression test proving a
NEW out-of-scope file the actor creates but never `git add`s (porcelain '??') is
still flagged as unexpected/warning — the real-world scope-leak case.

Notes: documented the lever verification + the strengthening options (default-strict
vs warn->actor-feedback vs single-subtask-strict) for a policy decision.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…leaks (self-healing)

Option (ii) for strengthening the mechanical scope lever. Previously a non-strict
scope leak detected by validate_mutation_boundary only produced a warn-log and the
MONITOR gate passed silently (hard-block only under MAP_STRICT_SCOPE). Now, at
validate_step("2.4"), a `warning` routes back to the Actor as feedback the FIRST
time it is seen per subtask (valid=False + actionable "revert the out-of-scope
changes OR escalate for a contract update"), so the actor self-corrects in the
existing retry loop — without a hard block.

- StepState.scope_feedback_subtasks (persisted, to_dict/from_dict) bounds the nudge
  to ONCE per subtask, so a persistent false positive (affected_files drift) cannot
  burn the retry budget — after the single nudge the gate passes.
- Strict-mode (MAP_STRICT_SCOPE=1) hard-reject path is unchanged.
- Edited the templates_src .jinja source and re-rendered (.map/scripts + templates).
- Tests: test_warning_routes_feedback_to_actor_once (orchestrator, incl. once-guard
  pass-through) + test_warning_on_untracked_new_out_of_scope_file (validator).

make check: 2259 passed, 3 skipped; check-render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng an empty subtask)

Apply the warn->actor-feedback pattern to MONITOR correctness validation. The
unenforced gap: MONITOR closing a subtask that declares affected_files but changed
NOTHING (empty diff) — a subtask that "completes" having done nothing.

At validate_step("2.4"), reusing the validate_mutation_boundary report: if
`expected` (declared affected_files) is non-empty but `actual` (changed files) is
empty, route back to the Actor once (valid=False + "False-progress: implement the
change or report a blocker"), bounded by StepState.progress_feedback_subtasks so a
re-validate passes (no retry-burn). Reuses the existing scope git machinery; no new
diff logic.

- StepState.progress_feedback_subtasks (persisted, to_dict/from_dict).
- Test: test_false_progress_routes_feedback_when_nothing_changed (+ once-guard pass-through).
- Edited templates_src .jinja + re-rendered.

make check: 2260 passed, 3 skipped; check-render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
map-skill-eval documented only trigger-description tuning. Add a section that
directs anyone improving a skill's BODY/logic (outcome quality) to the worked,
reusable flow + harness instead of starting from scratch:
docs/whole-skill-optimization-flow.md (+ notes), tests/skills_eval/whole_skill/
spike_runner.py, and the key finding (prose scope/correctness discipline is
low-leverage for thin orchestrators; the levers are the affected_files contract +
mechanical validators; prose pays off for report format + trigger description).

Edited the .jinja source + re-rendered. make check: 2260 passed, check-render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@azalio azalio merged commit effd649 into main Jun 5, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant