Whole-skill outcome-eval harness + map-task body hardening + skill eval-sets by azalio · Pull Request #160 · azalio/map-framework

azalio · 2026-06-05T00:33:31Z

What

Adds a whole-skill optimization capability — measuring and improving a skill's body (its SKILL.md logic), not just its trigger description: — and applies it to the pilot skill map-task. Approach B (harness measures → human edits), body-only mutation.

Harness + fixtures (reusable flow for any skill)

tests/skills_eval/whole_skill/spike_runner.py — seeds an isolated temp (.claude + .map/scripts + fixture repo, git init), runs claude -p "/map-task ST-001", and scores the outcome with a hybrid metric: deterministic gates (scope fidelity via git diff, task-pass via the fixture's tests) + a trace-cited LLM judge. QUALITY = gate·(0.5 + 0.5·judge). expected_outcome-aware (complete|blocked).
Fixtures under tests/skills_eval/fixtures/whole_skill/: a scope-trap and an impossible/blocker mini-repo (each a real git project + committed .map/ plan/blueprint).

Key empirical finding (2 fixtures, 12 runs, 2 llm-council consults)

Generic scope/blocker prose in a thin-orchestration body is low-leverage: Body-Good vs Body-Bad (rules stripped) scored identically 1.0 — those behaviors are enforced by the shared actor/monitor agents, not the body. The body's real lever is sequencing, context relay, retry/termination, and the final report contract. (To move scope/correctness you must optimize the shared agent prompts — out of this PR's body-only scope.)

map-task body hardening (body-owned surfaces) + regression proof

Formalized the Outcome Report (required fields) and added the previously-missing BLOCKED outcome report; explicit retry-exhaustion / impossible-in-scope termination (stop + report blocker, never fake-complete or expand scope).
Fixed real defects: dead "What this command CANNOT do" reference, placeholder example, awkward artifact section.
Regression-proved: QUALITY 1.0 on both fixtures ×3 after the edit (no outcome regression). Honest claim: cleaner/more-complete body, not a coding-quality gain.

Also

docs/whole-skill-optimization-{notes,flow}.md (method + the reusable flow for other skills) and docs/SKILL-EVAL.md (description-tuning engine guide) + USAGE pointer.
13 description-optimize eval-sets for the remaining /map-* skills.
Tooling hygiene: exclude the intentionally-broken whole_skill fixture mini-repos from pytest/ruff/pyright/mypy.

Validation

make check: 2257 passed, 3 skipped; ruff/mypy/pyright clean; check-render byte-identical. Whole-skill measurements run via claude -p (subscription auth).

🤖 Generated with Claude Code

…rdening + skill eval-sets Whole-skill optimization (pilot: map-task), Approach B (measure -> human-edit), body-only: - Harness: tests/skills_eval/whole_skill/spike_runner.py - seeds an isolated temp (.claude + .map/scripts + fixture repo, git init), runs `claude -p /map-task ST-001` with hybrid scoring (deterministic scope/task gates from git diff + a trace-cited LLM judge; QUALITY = gate*(0.5+0.5*judge)); expected_outcome-aware (complete|blocked). - Fixtures: scope-trap (F2) + impossible/blocker (F3) under tests/skills_eval/fixtures/whole_skill/ (real mini-repos + committed .map plan/blueprint). - Finding (2 fixtures, 12 runs, 2 llm-council consults): generic scope/blocker PROSE in a thin-orchestration body is LOW-LEVERAGE - body-good == body-bad == QUALITY 1.0, because the shared actor/monitor agents enforce those behaviors. The body's real lever is sequencing, context relay, retry/termination, and the final report contract. - map-task body hardening (body-owned surfaces): formalized the Outcome Report (COMPLETE + a new BLOCKED schema with required fields), explicit retry-exhaustion/impossible-in-scope termination; fixed a dead reference, a placeholder example, and an awkward artifact section. Regression-proved: QUALITY 1.0 on F1+F3 (no outcome regression). Honest claim: cleaner/more complete body, no regression - not a coding-quality gain (that needs the shared agent prompts). - Docs: docs/whole-skill-optimization-{notes,flow}.md (method + reusable flow for other skills); docs/SKILL-EVAL.md (description-tuning engine guide) + USAGE.md pointer. - 13 description-optimize eval-sets for the remaining /map-* skills (tests/skills_eval/fixtures/). - Tooling hygiene: exclude the whole_skill fixture mini-repos from pytest/ruff/pyright/mypy (they are intentionally-broken seeded repos). make check: 2257 passed, 3 skipped; ruff/mypy/pyright clean; check-render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ow-leverage (body AND actor) Extend the whole-skill harness with `--degrade {body,actor,monitor}` and add a strong scope-pressure fixture (the obvious one-line fix is out-of-scope), then ablate the ACTOR prompt: - spike_runner.py: `--degrade` targets which prompt the 'bad' variant degrades; `_degrade_actor` strips actor.md's Mutation Boundary section + the quick-ref NEVER-scope clause (seed-only); `_degrade_monitor` stub (best-effort). - Fixture tests/skills_eval/fixtures/whole_skill/map_task_scope_pressure: RATE trap — changing the shared RATE in config.py is the tempting out-of-scope fix; the in-scope fix is a surcharge in utils.py. Result (current actor vs --degrade actor, 3 runs each): BOTH kept scope perfectly (config.py never touched; only utils.py edited). The QUALITY delta (0.80 vs 1.00) is judge NOISE (inverted; the judge penalized the current actor for lacking verbose scope-reasoning prose despite perfect actual scope). Consolidated across 3 ablations (body + actor, 18 runs): prose-level scope discipline is low-leverage; scope is governed by the blueprint affected_files contract + base-model competence + the mechanical mutation-boundary/test-gate/monitor. Methodology note recorded: for scope, trust the deterministic gate — the scope_discipline judge dimension is verbosity-biased. Full log in docs/whole-skill-optimization-notes.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…on_boundary Option B (verify/strengthen the mechanical scope lever). The mutation-boundary validator is correct + warn-only by design (strict via MAP_STRICT_SCOPE), but its tests only exercised committed/staged extra files. Add a regression test proving a NEW out-of-scope file the actor creates but never `git add`s (porcelain '??') is still flagged as unexpected/warning — the real-world scope-leak case. Notes: documented the lever verification + the strengthening options (default-strict vs warn->actor-feedback vs single-subtask-strict) for a policy decision. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…leaks (self-healing) Option (ii) for strengthening the mechanical scope lever. Previously a non-strict scope leak detected by validate_mutation_boundary only produced a warn-log and the MONITOR gate passed silently (hard-block only under MAP_STRICT_SCOPE). Now, at validate_step("2.4"), a `warning` routes back to the Actor as feedback the FIRST time it is seen per subtask (valid=False + actionable "revert the out-of-scope changes OR escalate for a contract update"), so the actor self-corrects in the existing retry loop — without a hard block. - StepState.scope_feedback_subtasks (persisted, to_dict/from_dict) bounds the nudge to ONCE per subtask, so a persistent false positive (affected_files drift) cannot burn the retry budget — after the single nudge the gate passes. - Strict-mode (MAP_STRICT_SCOPE=1) hard-reject path is unchanged. - Edited the templates_src .jinja source and re-rendered (.map/scripts + templates). - Tests: test_warning_routes_feedback_to_actor_once (orchestrator, incl. once-guard pass-through) + test_warning_on_untracked_new_out_of_scope_file (validator). make check: 2259 passed, 3 skipped; check-render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ng an empty subtask) Apply the warn->actor-feedback pattern to MONITOR correctness validation. The unenforced gap: MONITOR closing a subtask that declares affected_files but changed NOTHING (empty diff) — a subtask that "completes" having done nothing. At validate_step("2.4"), reusing the validate_mutation_boundary report: if `expected` (declared affected_files) is non-empty but `actual` (changed files) is empty, route back to the Actor once (valid=False + "False-progress: implement the change or report a blocker"), bounded by StepState.progress_feedback_subtasks so a re-validate passes (no retry-burn). Reuses the existing scope git machinery; no new diff logic. - StepState.progress_feedback_subtasks (persisted, to_dict/from_dict). - Test: test_false_progress_routes_feedback_when_nothing_changed (+ once-guard pass-through). - Edited templates_src .jinja + re-rendered. make check: 2260 passed, 3 skipped; check-render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

map-skill-eval documented only trigger-description tuning. Add a section that directs anyone improving a skill's BODY/logic (outcome quality) to the worked, reusable flow + harness instead of starting from scratch: docs/whole-skill-optimization-flow.md (+ notes), tests/skills_eval/whole_skill/ spike_runner.py, and the key finding (prose scope/correctness discipline is low-leverage for thin orchestrators; the levers are the affected_files contract + mechanical validators; prose pays off for report format + trigger description). Edited the .jinja source + re-rendered. make check: 2260 passed, check-render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

azalio and others added 6 commits June 5, 2026 03:32

azalio merged commit effd649 into main Jun 5, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whole-skill outcome-eval harness + map-task body hardening + skill eval-sets#160

Whole-skill outcome-eval harness + map-task body hardening + skill eval-sets#160
azalio merged 6 commits into
mainfrom
organ-pipe-pyrite

azalio commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azalio commented Jun 5, 2026

What

Harness + fixtures (reusable flow for any skill)

Key empirical finding (2 fixtures, 12 runs, 2 llm-council consults)

map-task body hardening (body-owned surfaces) + regression proof

Also

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant