Phase F: skill-eval description optimizer + HTML viewer (+ telegram isolation, applied map-plan description) by azalio · Pull Request #159 · azalio/map-framework

azalio · 2026-06-04T19:23:05Z

Phase F — Skill-Eval Description Optimizer + HTML Viewer (+ fixes)

Builds the optimization/reporting layer on top of the shipped F.1 eval engine, plus two fixes surfaced by a real end-to-end run.

What's in it

Phase F (9 subtasks, each Monitor-approved):

eval_schema.py: OptimizeIterationRecord / OptimizeResult / ProposerFn (round-trip to_dict/from_dict).
proposer.py: claude -p description proposer (argv-list, MAP_INVOKED_BY, None on all failures).
description_optimizer.py: anti-overfit optimizer — deterministic hashlib 60/40 split, selection by held-out TEST pass-rate (overfit candidates flagged and never selected), per-iteration resume=False jsonl isolation, candidate .claude/ re-seed cleaned in finally. Clock-free (caller supplies run_ts).
viewer.py: jinja2 HTML report — autoescape=True (XSS-safe), difflib per-iteration diff, overfit rows red.
apply_patcher.py: fail-loud block-scalar frontmatter --apply patcher → re-render to byte-identical trees, scoped git add, path-safe.
optimize / view CLI subcommands (exit 2/0/1, dry-run budget, --open best-effort, JSON+HTML artifacts).
3 optimizer eval-set fixtures (≥8 entries) + authoring note; no-anthropic guard over all 4 new modules.

Fix 1 — eval/telegram isolation (f46e78d): the eval claude -p subprocess inherited the user's telegram-bridge SessionStart hook (tg listen), which could block on the Telegram long-poll until the dispatch timeout (a triggered-skill cell then mis-records as a non-trigger). The dispatcher now sets TG_STATE_DIR to a config-less path so any tg/tg listen the agent runs exits immediately instead of blocking. (Best-effort: the SessionStart injection still appears via the plugin hook env, so a perfectly-obedient agent can still spend a few turns on it; the hard block is removed.)

Fix 2 — apply optimized map-plan description + correct the description cap (b1c515f): a real optimize run on map-plan selected a candidate that doubled the held-out TEST pass-rate (25%→50%, no overfit); applied it. The winner is 622 chars, which the old self-imposed 250-char test rejected — but 250 was a transient Claude Code v2.1.86 cap (raised to 1536 in v2.1.105, replaced by a usage-ranked listing budget in v2.1.129+). The Agent Skills spec maximum is 1024 chars and the model loads the full description for triggering, so the test now enforces 1024 and the proposer caps candidates at 1024 (prompt instruction + hard rejection of over-limit proposals). Refs: anthropics/claude-code #40121, #47627; code.claude.com/docs/en/skills.

Real run evidence (2 skills, not mocked)

map-plan: winner = iter 1 — baseline 20%/25% (train/test) → candidate 20%/50%, selected on held-out test, not overfit.
map-debug: no improvement (full tie → baseline retained), 25%.

Verification

make check (lint + pyright 0/0/0 + full pytest) and make check-render green on a clean run (2247→2256 tests as the suite grew across subtasks).
Final Verifier (Ralph loop): PASS.

Note for reviewers: on the author's machine, two timing-sensitive hook tests flake under heavy local load + an unusually large active-session transcript (test_hook_inventory_smoke::test_every_configured_hook_execs_via_shebang — resolve_session_id scanning a multi-MB transcript; test_safety_guardrails::test_execution_under_100ms — 116ms vs 100ms). Both are unrelated to this diff (it touches no hooks/memory) and pass in clean CI.

🤖 Generated with Claude Code

… in eval_schema Contract-first typed structs for the description optimizer (AC-7): round-trip to_dict/from_dict mirroring EvalResultRecord's _MISSING-tolerant pattern, flat per-iteration token totals, overfit/selected flags, and per-iteration jsonl paths so the viewer renders without re-parsing. ProposerFn aliases the Callable[[str, list[EvalResultRecord]], str | None] proposer interface. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Default ProposerFn: propose_description() builds an argv list for `claude -p --output-format json` (never a shell string), sets MAP_INVOKED_BY on the subprocess env (INV-2/HC-7), reads only .result, and returns the stripped candidate text or None on non-zero exit / timeout / OSError / malformed JSON / empty .result. Untrusted eval-record text is passed as a discrete argv element — no shell interpolation. No anthropic import, no ANTHROPIC_API_KEY, no --model flag (INV-1/AC-10/D2). Adds the no-anthropic source-scan guard (test_inv1_no_anthropic_optimize.py, extended in ST-009). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ction optimize() runs an N-iteration loop (iter 0 = baseline) over a deterministic 60/40 split and selects the candidate that maximises held-out TEST pass-rate, never the overfit one (train↑/test↓ flagged overfit, structurally excluded from selection). Determinism without random/datetime: split_train_test keys a hashlib.sha256 ordering by seed (clock-free; run_ts supplied by caller). Each candidate is evaluated by re-seeding a throwaway .claude/ copy, patching ONLY the SKILL.md frontmatter description (fail-loud), and running runner.run_eval over train+test to per-iteration isolated jsonl paths (resume=False); the temp seed is removed in finally — production .claude/ and templates_src/ untouched. Proposer returning None records a proposal_failed iteration and the loop continues with baseline eligible; full-tie => baseline wins (no_improvement). Also: guard iterations>=1 (avoid latent IndexError) + tests for the fail-loud frontmatter raise paths; extend the no-anthropic guard to cover this module. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

render_html()/render_to_path() render a pure HTML report from the typed OptimizeResult (Producer-Owns-Parse, no dispatch/subprocess): one row per iteration with a difflib unified-diff vs the prior iteration, train/test pass-rates, token totals, the selected iteration marked, and overfit rows (train↑/test↓) highlighted red. Security: candidate_description is untrusted claude -p output, so the renderer uses jinja2 Environment(autoescape=True) plus explicit |e on every dynamic slot — an embedded <script> is escaped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…er orchestration patch_skill_description() rewrites ONLY the YAML frontmatter `description:` block scalar of a SKILL.md(.jinja) — fail-loud (no partial write) on missing frontmatter/anchor, written as a `|-` block scalar that round-trips exactly. apply_optimized_description() edits the single-source templates_src/skills/<skill>/SKILL.md.jinja then re-renders both providers so generated trees stay byte-identical (INV-5/check-render), staging only the patched source + existing gate trees (scoped `git add --`, never `-A`, never commits). Path-safety (under templates_src, reject `.git/`) runs before any FS touch. Two distinct no-op messages (baseline wins vs winner==current); baseline never written back; skill-rules.json never auto-patched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`mapify skill-eval optimize <skill> --eval-set [--iterations --apply --open --dry-run]` drives the ST-003 optimizer (proposer + viewer + apply patcher), persisting OptimizeResult JSON + HTML under .map/eval-runs/<skill>/. Strict exit-code order: <5 entries -> Exit(2) (>=5 minimum) before any dispatcher; --dry-run prints the call budget + 'model: default (resolved by claude CLI)' and Exit(0) with zero quota; claude-absent -> Exit(1). run_ts is generated at the CLI boundary (clock-free core). `view <skill> [--result --open]` renders a stored result. --open is best-effort (never errors the run). Without --apply, nothing outside .map/ is touched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ng note New map-plan (9), map-efficient (9), and larger map-debug (10) eval-sets for `skill-eval optimize`, each splittable 60/40 with n_test>=3 so the held-out test pass-rate is meaningful (a <5-entry set degenerates to a 0/1 coin-flip). README documents the purpose + >=8 sizing rationale. The 3-entry map_debug_eval_set.json smoke fixture is left untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…docs Edited the .jinja sources (SKILL.md.jinja + skill-rules.json.jinja) to document `skill-eval optimize <skill> --eval-set --iterations --apply --open --dry-run` (anti-overfit held-out selection, propose-only default, two no-op cases) and `skill-eval view <skill> --result --open`, plus new skill-rules keywords/ intentPatterns; requires-cmd stays ["claude"]. Ran `make render-templates`; check-render + skills-consistency + template-render green. Also re-syncs the generated .map/scripts/map_orchestrator.py to its .jinja source: the gitignore-tolerance in record_subtask_result was already in templates_src but the committed generated file was stale (.map/ is not in check-render's gate, so the drift went uncaught). render-templates corrects it. Follow-up (noted): add .map/scripts/ to check-render's gate to catch such drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… green The INV-1/AC-10 AST guard now covers proposer, description_optimizer, viewer, and apply_patcher (no `import anthropic`, no ANTHROPIC_API_KEY). Integration gate verified: `make check` green (2247 passed), `make check-render` green, pyright 0/0/0 on every touched Phase F source + test file. uv-run confirmed to resolve to this worktree, so the green reflects the real Phase F code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tener The eval dispatcher's claude -p subprocess inherits the user's telegram-bridge plugin SessionStart hook, which injects an "always-listen — run `tg listen`" instruction. When the eval agent obeys it, `tg listen` blocks on the Telegram long-poll until the dispatch timeout, so a triggered-skill cell mis-records as a non-trigger (and the run wall-clock explodes — observed cells hitting the full 120s/3600s timeout). Fix: `_eval_subprocess_env(cwd)` sets TG_STATE_DIR to a config-less path inside the throwaway eval cwd. Any `tg listen`/`tg send` the agent runs inherits this env, finds no config.json, and exits immediately (`die`) instead of blocking — neutralising the hang. The operator's real ~/.claude/telegram config is never touched (per-subprocess override on a path removed with the temp cwd). Plugin hooks run in a restricted env so the cosmetic injection may still appear, but it is inert. MAP_INVOKED_BY guard preserved. Verified end-to-end: previously-hanging prompts now finish in ~80-160s through the real ClaudeSubprocessDispatcher with the skill still triggering. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…on cap to the spec (1024) Applies the optimizer's winning map-plan description (the candidate selected on held-out TEST pass-rate: 25%->50% vs baseline, no overfit) to the single-source .jinja and re-renders byte-identical trees. The winner is 622 chars, which the old self-imposed 250-char test rejected. That 250 was a transient Claude Code v2.1.86 listing cap — raised to 1536 in v2.1.105 and replaced by a usage-ranked listing budget in v2.1.129+. The Agent Skills spec maximum for `description` is 1024 chars, and the model loads the FULL description for triggering. So 250 was stale; the test now enforces the real 1024-char spec limit. The proposer is constrained to <=1024 chars (prompt instruction + hard rejection of an over-limit candidate) so the optimizer only ever proves a shippable description. Refs: anthropics/claude-code #40121, #47627; code.claude.com/docs/en/skills. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

azalio and others added 11 commits June 4, 2026 14:27

azalio merged commit 1b1d064 into main Jun 4, 2026
6 checks passed

azalio deleted the smoke-cobalt branch June 4, 2026 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase F: skill-eval description optimizer + HTML viewer (+ telegram isolation, applied map-plan description)#159

Phase F: skill-eval description optimizer + HTML viewer (+ telegram isolation, applied map-plan description)#159
azalio merged 11 commits into
mainfrom
smoke-cobalt

azalio commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azalio commented Jun 4, 2026

Phase F — Skill-Eval Description Optimizer + HTML Viewer (+ fixes)

What's in it

Real run evidence (2 skills, not mocked)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant