Phase F: skill-eval description optimizer + HTML viewer (+ telegram isolation, applied map-plan description)#159
Merged
Merged
Conversation
… in eval_schema Contract-first typed structs for the description optimizer (AC-7): round-trip to_dict/from_dict mirroring EvalResultRecord's _MISSING-tolerant pattern, flat per-iteration token totals, overfit/selected flags, and per-iteration jsonl paths so the viewer renders without re-parsing. ProposerFn aliases the Callable[[str, list[EvalResultRecord]], str | None] proposer interface. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Default ProposerFn: propose_description() builds an argv list for `claude -p --output-format json` (never a shell string), sets MAP_INVOKED_BY on the subprocess env (INV-2/HC-7), reads only .result, and returns the stripped candidate text or None on non-zero exit / timeout / OSError / malformed JSON / empty .result. Untrusted eval-record text is passed as a discrete argv element — no shell interpolation. No anthropic import, no ANTHROPIC_API_KEY, no --model flag (INV-1/AC-10/D2). Adds the no-anthropic source-scan guard (test_inv1_no_anthropic_optimize.py, extended in ST-009). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ction optimize() runs an N-iteration loop (iter 0 = baseline) over a deterministic 60/40 split and selects the candidate that maximises held-out TEST pass-rate, never the overfit one (train↑/test↓ flagged overfit, structurally excluded from selection). Determinism without random/datetime: split_train_test keys a hashlib.sha256 ordering by seed (clock-free; run_ts supplied by caller). Each candidate is evaluated by re-seeding a throwaway .claude/ copy, patching ONLY the SKILL.md frontmatter description (fail-loud), and running runner.run_eval over train+test to per-iteration isolated jsonl paths (resume=False); the temp seed is removed in finally — production .claude/ and templates_src/ untouched. Proposer returning None records a proposal_failed iteration and the loop continues with baseline eligible; full-tie => baseline wins (no_improvement). Also: guard iterations>=1 (avoid latent IndexError) + tests for the fail-loud frontmatter raise paths; extend the no-anthropic guard to cover this module. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
render_html()/render_to_path() render a pure HTML report from the typed OptimizeResult (Producer-Owns-Parse, no dispatch/subprocess): one row per iteration with a difflib unified-diff vs the prior iteration, train/test pass-rates, token totals, the selected iteration marked, and overfit rows (train↑/test↓) highlighted red. Security: candidate_description is untrusted claude -p output, so the renderer uses jinja2 Environment(autoescape=True) plus explicit |e on every dynamic slot — an embedded <script> is escaped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er orchestration patch_skill_description() rewrites ONLY the YAML frontmatter `description:` block scalar of a SKILL.md(.jinja) — fail-loud (no partial write) on missing frontmatter/anchor, written as a `|-` block scalar that round-trips exactly. apply_optimized_description() edits the single-source templates_src/skills/<skill>/SKILL.md.jinja then re-renders both providers so generated trees stay byte-identical (INV-5/check-render), staging only the patched source + existing gate trees (scoped `git add --`, never `-A`, never commits). Path-safety (under templates_src, reject `.git/`) runs before any FS touch. Two distinct no-op messages (baseline wins vs winner==current); baseline never written back; skill-rules.json never auto-patched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`mapify skill-eval optimize <skill> --eval-set [--iterations --apply --open --dry-run]` drives the ST-003 optimizer (proposer + viewer + apply patcher), persisting OptimizeResult JSON + HTML under .map/eval-runs/<skill>/. Strict exit-code order: <5 entries -> Exit(2) (>=5 minimum) before any dispatcher; --dry-run prints the call budget + 'model: default (resolved by claude CLI)' and Exit(0) with zero quota; claude-absent -> Exit(1). run_ts is generated at the CLI boundary (clock-free core). `view <skill> [--result --open]` renders a stored result. --open is best-effort (never errors the run). Without --apply, nothing outside .map/ is touched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng note New map-plan (9), map-efficient (9), and larger map-debug (10) eval-sets for `skill-eval optimize`, each splittable 60/40 with n_test>=3 so the held-out test pass-rate is meaningful (a <5-entry set degenerates to a 0/1 coin-flip). README documents the purpose + >=8 sizing rationale. The 3-entry map_debug_eval_set.json smoke fixture is left untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…docs Edited the .jinja sources (SKILL.md.jinja + skill-rules.json.jinja) to document `skill-eval optimize <skill> --eval-set --iterations --apply --open --dry-run` (anti-overfit held-out selection, propose-only default, two no-op cases) and `skill-eval view <skill> --result --open`, plus new skill-rules keywords/ intentPatterns; requires-cmd stays ["claude"]. Ran `make render-templates`; check-render + skills-consistency + template-render green. Also re-syncs the generated .map/scripts/map_orchestrator.py to its .jinja source: the gitignore-tolerance in record_subtask_result was already in templates_src but the committed generated file was stale (.map/ is not in check-render's gate, so the drift went uncaught). render-templates corrects it. Follow-up (noted): add .map/scripts/ to check-render's gate to catch such drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… green The INV-1/AC-10 AST guard now covers proposer, description_optimizer, viewer, and apply_patcher (no `import anthropic`, no ANTHROPIC_API_KEY). Integration gate verified: `make check` green (2247 passed), `make check-render` green, pyright 0/0/0 on every touched Phase F source + test file. uv-run confirmed to resolve to this worktree, so the green reflects the real Phase F code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tener The eval dispatcher's claude -p subprocess inherits the user's telegram-bridge plugin SessionStart hook, which injects an "always-listen — run `tg listen`" instruction. When the eval agent obeys it, `tg listen` blocks on the Telegram long-poll until the dispatch timeout, so a triggered-skill cell mis-records as a non-trigger (and the run wall-clock explodes — observed cells hitting the full 120s/3600s timeout). Fix: `_eval_subprocess_env(cwd)` sets TG_STATE_DIR to a config-less path inside the throwaway eval cwd. Any `tg listen`/`tg send` the agent runs inherits this env, finds no config.json, and exits immediately (`die`) instead of blocking — neutralising the hang. The operator's real ~/.claude/telegram config is never touched (per-subprocess override on a path removed with the temp cwd). Plugin hooks run in a restricted env so the cosmetic injection may still appear, but it is inert. MAP_INVOKED_BY guard preserved. Verified end-to-end: previously-hanging prompts now finish in ~80-160s through the real ClaudeSubprocessDispatcher with the skill still triggering. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on cap to the spec (1024) Applies the optimizer's winning map-plan description (the candidate selected on held-out TEST pass-rate: 25%->50% vs baseline, no overfit) to the single-source .jinja and re-renders byte-identical trees. The winner is 622 chars, which the old self-imposed 250-char test rejected. That 250 was a transient Claude Code v2.1.86 listing cap — raised to 1536 in v2.1.105 and replaced by a usage-ranked listing budget in v2.1.129+. The Agent Skills spec maximum for `description` is 1024 chars, and the model loads the FULL description for triggering. So 250 was stale; the test now enforces the real 1024-char spec limit. The proposer is constrained to <=1024 chars (prompt instruction + hard rejection of an over-limit candidate) so the optimizer only ever proves a shippable description. Refs: anthropics/claude-code #40121, #47627; code.claude.com/docs/en/skills. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase F — Skill-Eval Description Optimizer + HTML Viewer (+ fixes)
Builds the optimization/reporting layer on top of the shipped F.1 eval engine, plus two fixes surfaced by a real end-to-end run.
What's in it
Phase F (9 subtasks, each Monitor-approved):
eval_schema.py:OptimizeIterationRecord/OptimizeResult/ProposerFn(round-trip to_dict/from_dict).proposer.py:claude -pdescription proposer (argv-list,MAP_INVOKED_BY, None on all failures).description_optimizer.py: anti-overfit optimizer — deterministic hashlib 60/40 split, selection by held-out TEST pass-rate (overfit candidates flagged and never selected), per-iterationresume=Falsejsonl isolation, candidate.claude/re-seed cleaned infinally. Clock-free (caller suppliesrun_ts).viewer.py: jinja2 HTML report —autoescape=True(XSS-safe), difflib per-iteration diff, overfit rows red.apply_patcher.py: fail-loud block-scalar frontmatter--applypatcher → re-render to byte-identical trees, scopedgit add, path-safe.optimize/viewCLI subcommands (exit2/0/1, dry-run budget,--openbest-effort, JSON+HTML artifacts).anthropicguard over all 4 new modules.Fix 1 — eval/telegram isolation (
f46e78d): the evalclaude -psubprocess inherited the user'stelegram-bridgeSessionStart hook (tg listen), which could block on the Telegram long-poll until the dispatch timeout (a triggered-skill cell then mis-records as a non-trigger). The dispatcher now setsTG_STATE_DIRto a config-less path so anytg/tg listenthe agent runs exits immediately instead of blocking. (Best-effort: the SessionStart injection still appears via the plugin hook env, so a perfectly-obedient agent can still spend a few turns on it; the hard block is removed.)Fix 2 — apply optimized map-plan description + correct the description cap (
b1c515f): a real optimize run on map-plan selected a candidate that doubled the held-out TEST pass-rate (25%→50%, no overfit); applied it. The winner is 622 chars, which the old self-imposed 250-char test rejected — but 250 was a transient Claude Code v2.1.86 cap (raised to 1536 in v2.1.105, replaced by a usage-ranked listing budget in v2.1.129+). The Agent Skills spec maximum is 1024 chars and the model loads the full description for triggering, so the test now enforces 1024 and the proposer caps candidates at 1024 (prompt instruction + hard rejection of over-limit proposals). Refs: anthropics/claude-code #40121, #47627; code.claude.com/docs/en/skills.Real run evidence (2 skills, not mocked)
Verification
make check(lint + pyright 0/0/0 + full pytest) andmake check-rendergreen on a clean run (2247→2256 tests as the suite grew across subtasks).🤖 Generated with Claude Code