Whole-skill outcome-eval harness + map-task body hardening + skill eval-sets#160
Merged
Conversation
…rdening + skill eval-sets
Whole-skill optimization (pilot: map-task), Approach B (measure -> human-edit), body-only:
- Harness: tests/skills_eval/whole_skill/spike_runner.py - seeds an isolated temp
(.claude + .map/scripts + fixture repo, git init), runs `claude -p /map-task ST-001`
with hybrid scoring (deterministic scope/task gates from git diff + a trace-cited LLM
judge; QUALITY = gate*(0.5+0.5*judge)); expected_outcome-aware (complete|blocked).
- Fixtures: scope-trap (F2) + impossible/blocker (F3) under
tests/skills_eval/fixtures/whole_skill/ (real mini-repos + committed .map plan/blueprint).
- Finding (2 fixtures, 12 runs, 2 llm-council consults): generic scope/blocker PROSE in a
thin-orchestration body is LOW-LEVERAGE - body-good == body-bad == QUALITY 1.0, because the
shared actor/monitor agents enforce those behaviors. The body's real lever is sequencing,
context relay, retry/termination, and the final report contract.
- map-task body hardening (body-owned surfaces): formalized the Outcome Report (COMPLETE +
a new BLOCKED schema with required fields), explicit retry-exhaustion/impossible-in-scope
termination; fixed a dead reference, a placeholder example, and an awkward artifact section.
Regression-proved: QUALITY 1.0 on F1+F3 (no outcome regression). Honest claim: cleaner/more
complete body, no regression - not a coding-quality gain (that needs the shared agent prompts).
- Docs: docs/whole-skill-optimization-{notes,flow}.md (method + reusable flow for other skills);
docs/SKILL-EVAL.md (description-tuning engine guide) + USAGE.md pointer.
- 13 description-optimize eval-sets for the remaining /map-* skills (tests/skills_eval/fixtures/).
- Tooling hygiene: exclude the whole_skill fixture mini-repos from pytest/ruff/pyright/mypy
(they are intentionally-broken seeded repos).
make check: 2257 passed, 3 skipped; ruff/mypy/pyright clean; check-render byte-identical.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ow-leverage (body AND actor)
Extend the whole-skill harness with `--degrade {body,actor,monitor}` and add a strong
scope-pressure fixture (the obvious one-line fix is out-of-scope), then ablate the ACTOR prompt:
- spike_runner.py: `--degrade` targets which prompt the 'bad' variant degrades; `_degrade_actor`
strips actor.md's Mutation Boundary section + the quick-ref NEVER-scope clause (seed-only);
`_degrade_monitor` stub (best-effort).
- Fixture tests/skills_eval/fixtures/whole_skill/map_task_scope_pressure: RATE trap — changing the
shared RATE in config.py is the tempting out-of-scope fix; the in-scope fix is a surcharge in utils.py.
Result (current actor vs --degrade actor, 3 runs each): BOTH kept scope perfectly (config.py never
touched; only utils.py edited). The QUALITY delta (0.80 vs 1.00) is judge NOISE (inverted; the judge
penalized the current actor for lacking verbose scope-reasoning prose despite perfect actual scope).
Consolidated across 3 ablations (body + actor, 18 runs): prose-level scope discipline is low-leverage;
scope is governed by the blueprint affected_files contract + base-model competence + the mechanical
mutation-boundary/test-gate/monitor. Methodology note recorded: for scope, trust the deterministic
gate — the scope_discipline judge dimension is verbosity-biased. Full log in
docs/whole-skill-optimization-notes.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on_boundary Option B (verify/strengthen the mechanical scope lever). The mutation-boundary validator is correct + warn-only by design (strict via MAP_STRICT_SCOPE), but its tests only exercised committed/staged extra files. Add a regression test proving a NEW out-of-scope file the actor creates but never `git add`s (porcelain '??') is still flagged as unexpected/warning — the real-world scope-leak case. Notes: documented the lever verification + the strengthening options (default-strict vs warn->actor-feedback vs single-subtask-strict) for a policy decision. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…leaks (self-healing)
Option (ii) for strengthening the mechanical scope lever. Previously a non-strict
scope leak detected by validate_mutation_boundary only produced a warn-log and the
MONITOR gate passed silently (hard-block only under MAP_STRICT_SCOPE). Now, at
validate_step("2.4"), a `warning` routes back to the Actor as feedback the FIRST
time it is seen per subtask (valid=False + actionable "revert the out-of-scope
changes OR escalate for a contract update"), so the actor self-corrects in the
existing retry loop — without a hard block.
- StepState.scope_feedback_subtasks (persisted, to_dict/from_dict) bounds the nudge
to ONCE per subtask, so a persistent false positive (affected_files drift) cannot
burn the retry budget — after the single nudge the gate passes.
- Strict-mode (MAP_STRICT_SCOPE=1) hard-reject path is unchanged.
- Edited the templates_src .jinja source and re-rendered (.map/scripts + templates).
- Tests: test_warning_routes_feedback_to_actor_once (orchestrator, incl. once-guard
pass-through) + test_warning_on_untracked_new_out_of_scope_file (validator).
make check: 2259 passed, 3 skipped; check-render byte-identical.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng an empty subtask)
Apply the warn->actor-feedback pattern to MONITOR correctness validation. The
unenforced gap: MONITOR closing a subtask that declares affected_files but changed
NOTHING (empty diff) — a subtask that "completes" having done nothing.
At validate_step("2.4"), reusing the validate_mutation_boundary report: if
`expected` (declared affected_files) is non-empty but `actual` (changed files) is
empty, route back to the Actor once (valid=False + "False-progress: implement the
change or report a blocker"), bounded by StepState.progress_feedback_subtasks so a
re-validate passes (no retry-burn). Reuses the existing scope git machinery; no new
diff logic.
- StepState.progress_feedback_subtasks (persisted, to_dict/from_dict).
- Test: test_false_progress_routes_feedback_when_nothing_changed (+ once-guard pass-through).
- Edited templates_src .jinja + re-rendered.
make check: 2260 passed, 3 skipped; check-render byte-identical.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
map-skill-eval documented only trigger-description tuning. Add a section that directs anyone improving a skill's BODY/logic (outcome quality) to the worked, reusable flow + harness instead of starting from scratch: docs/whole-skill-optimization-flow.md (+ notes), tests/skills_eval/whole_skill/ spike_runner.py, and the key finding (prose scope/correctness discipline is low-leverage for thin orchestrators; the levers are the affected_files contract + mechanical validators; prose pays off for report format + trigger description). Edited the .jinja source + re-rendered. make check: 2260 passed, check-render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a whole-skill optimization capability — measuring and improving a skill's body (its SKILL.md logic), not just its trigger
description:— and applies it to the pilot skill map-task. Approach B (harness measures → human edits), body-only mutation.Harness + fixtures (reusable flow for any skill)
tests/skills_eval/whole_skill/spike_runner.py— seeds an isolated temp (.claude+.map/scripts+ fixture repo,git init), runsclaude -p "/map-task ST-001", and scores the outcome with a hybrid metric: deterministic gates (scope fidelity viagit diff, task-pass via the fixture's tests) + a trace-cited LLM judge.QUALITY = gate·(0.5 + 0.5·judge).expected_outcome-aware (complete|blocked).tests/skills_eval/fixtures/whole_skill/: a scope-trap and an impossible/blocker mini-repo (each a real git project + committed.map/plan/blueprint).Key empirical finding (2 fixtures, 12 runs, 2 llm-council consults)
Generic scope/blocker prose in a thin-orchestration body is low-leverage: Body-Good vs Body-Bad (rules stripped) scored identically 1.0 — those behaviors are enforced by the shared
actor/monitoragents, not the body. The body's real lever is sequencing, context relay, retry/termination, and the final report contract. (To move scope/correctness you must optimize the shared agent prompts — out of this PR's body-only scope.)map-task body hardening (body-owned surfaces) + regression proof
Also
docs/whole-skill-optimization-{notes,flow}.md(method + the reusable flow for other skills) anddocs/SKILL-EVAL.md(description-tuning engine guide) + USAGE pointer./map-*skills.whole_skillfixture mini-repos from pytest/ruff/pyright/mypy.Validation
make check: 2257 passed, 3 skipped; ruff/mypy/pyright clean;check-renderbyte-identical. Whole-skill measurements run viaclaude -p(subscription auth).🤖 Generated with Claude Code