Skip to content

Phase F: skill-eval description optimizer + HTML viewer (+ telegram isolation, applied map-plan description)#159

Merged
azalio merged 11 commits into
mainfrom
smoke-cobalt
Jun 4, 2026
Merged

Phase F: skill-eval description optimizer + HTML viewer (+ telegram isolation, applied map-plan description)#159
azalio merged 11 commits into
mainfrom
smoke-cobalt

Conversation

@azalio
Copy link
Copy Markdown
Owner

@azalio azalio commented Jun 4, 2026

Phase F — Skill-Eval Description Optimizer + HTML Viewer (+ fixes)

Builds the optimization/reporting layer on top of the shipped F.1 eval engine, plus two fixes surfaced by a real end-to-end run.

What's in it

Phase F (9 subtasks, each Monitor-approved):

  • eval_schema.py: OptimizeIterationRecord / OptimizeResult / ProposerFn (round-trip to_dict/from_dict).
  • proposer.py: claude -p description proposer (argv-list, MAP_INVOKED_BY, None on all failures).
  • description_optimizer.py: anti-overfit optimizer — deterministic hashlib 60/40 split, selection by held-out TEST pass-rate (overfit candidates flagged and never selected), per-iteration resume=False jsonl isolation, candidate .claude/ re-seed cleaned in finally. Clock-free (caller supplies run_ts).
  • viewer.py: jinja2 HTML report — autoescape=True (XSS-safe), difflib per-iteration diff, overfit rows red.
  • apply_patcher.py: fail-loud block-scalar frontmatter --apply patcher → re-render to byte-identical trees, scoped git add, path-safe.
  • optimize / view CLI subcommands (exit 2/0/1, dry-run budget, --open best-effort, JSON+HTML artifacts).
  • 3 optimizer eval-set fixtures (≥8 entries) + authoring note; no-anthropic guard over all 4 new modules.

Fix 1 — eval/telegram isolation (f46e78d): the eval claude -p subprocess inherited the user's telegram-bridge SessionStart hook (tg listen), which could block on the Telegram long-poll until the dispatch timeout (a triggered-skill cell then mis-records as a non-trigger). The dispatcher now sets TG_STATE_DIR to a config-less path so any tg/tg listen the agent runs exits immediately instead of blocking. (Best-effort: the SessionStart injection still appears via the plugin hook env, so a perfectly-obedient agent can still spend a few turns on it; the hard block is removed.)

Fix 2 — apply optimized map-plan description + correct the description cap (b1c515f): a real optimize run on map-plan selected a candidate that doubled the held-out TEST pass-rate (25%→50%, no overfit); applied it. The winner is 622 chars, which the old self-imposed 250-char test rejected — but 250 was a transient Claude Code v2.1.86 cap (raised to 1536 in v2.1.105, replaced by a usage-ranked listing budget in v2.1.129+). The Agent Skills spec maximum is 1024 chars and the model loads the full description for triggering, so the test now enforces 1024 and the proposer caps candidates at 1024 (prompt instruction + hard rejection of over-limit proposals). Refs: anthropics/claude-code #40121, #47627; code.claude.com/docs/en/skills.

Real run evidence (2 skills, not mocked)

  • map-plan: winner = iter 1 — baseline 20%/25% (train/test) → candidate 20%/50%, selected on held-out test, not overfit.
  • map-debug: no improvement (full tie → baseline retained), 25%.

Verification

  • make check (lint + pyright 0/0/0 + full pytest) and make check-render green on a clean run (2247→2256 tests as the suite grew across subtasks).
  • Final Verifier (Ralph loop): PASS.

Note for reviewers: on the author's machine, two timing-sensitive hook tests flake under heavy local load + an unusually large active-session transcript (test_hook_inventory_smoke::test_every_configured_hook_execs_via_shebangresolve_session_id scanning a multi-MB transcript; test_safety_guardrails::test_execution_under_100ms — 116ms vs 100ms). Both are unrelated to this diff (it touches no hooks/memory) and pass in clean CI.

🤖 Generated with Claude Code

azalio and others added 11 commits June 4, 2026 14:27
… in eval_schema

Contract-first typed structs for the description optimizer (AC-7): round-trip
to_dict/from_dict mirroring EvalResultRecord's _MISSING-tolerant pattern, flat
per-iteration token totals, overfit/selected flags, and per-iteration jsonl
paths so the viewer renders without re-parsing. ProposerFn aliases the
Callable[[str, list[EvalResultRecord]], str | None] proposer interface.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Default ProposerFn: propose_description() builds an argv list for
`claude -p --output-format json` (never a shell string), sets MAP_INVOKED_BY
on the subprocess env (INV-2/HC-7), reads only .result, and returns the
stripped candidate text or None on non-zero exit / timeout / OSError /
malformed JSON / empty .result. Untrusted eval-record text is passed as a
discrete argv element — no shell interpolation. No anthropic import, no
ANTHROPIC_API_KEY, no --model flag (INV-1/AC-10/D2). Adds the no-anthropic
source-scan guard (test_inv1_no_anthropic_optimize.py, extended in ST-009).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ction

optimize() runs an N-iteration loop (iter 0 = baseline) over a deterministic
60/40 split and selects the candidate that maximises held-out TEST pass-rate,
never the overfit one (train↑/test↓ flagged overfit, structurally excluded
from selection). Determinism without random/datetime: split_train_test keys a
hashlib.sha256 ordering by seed (clock-free; run_ts supplied by caller). Each
candidate is evaluated by re-seeding a throwaway .claude/ copy, patching ONLY
the SKILL.md frontmatter description (fail-loud), and running runner.run_eval
over train+test to per-iteration isolated jsonl paths (resume=False); the temp
seed is removed in finally — production .claude/ and templates_src/ untouched.
Proposer returning None records a proposal_failed iteration and the loop
continues with baseline eligible; full-tie => baseline wins (no_improvement).

Also: guard iterations>=1 (avoid latent IndexError) + tests for the fail-loud
frontmatter raise paths; extend the no-anthropic guard to cover this module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
render_html()/render_to_path() render a pure HTML report from the typed
OptimizeResult (Producer-Owns-Parse, no dispatch/subprocess): one row per
iteration with a difflib unified-diff vs the prior iteration, train/test
pass-rates, token totals, the selected iteration marked, and overfit rows
(train↑/test↓) highlighted red. Security: candidate_description is untrusted
claude -p output, so the renderer uses jinja2 Environment(autoescape=True)
plus explicit |e on every dynamic slot — an embedded <script> is escaped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er orchestration

patch_skill_description() rewrites ONLY the YAML frontmatter `description:` block
scalar of a SKILL.md(.jinja) — fail-loud (no partial write) on missing
frontmatter/anchor, written as a `|-` block scalar that round-trips exactly.
apply_optimized_description() edits the single-source
templates_src/skills/<skill>/SKILL.md.jinja then re-renders both providers so
generated trees stay byte-identical (INV-5/check-render), staging only the
patched source + existing gate trees (scoped `git add --`, never `-A`, never
commits). Path-safety (under templates_src, reject `.git/`) runs before any FS
touch. Two distinct no-op messages (baseline wins vs winner==current); baseline
never written back; skill-rules.json never auto-patched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`mapify skill-eval optimize <skill> --eval-set [--iterations --apply --open
--dry-run]` drives the ST-003 optimizer (proposer + viewer + apply patcher),
persisting OptimizeResult JSON + HTML under .map/eval-runs/<skill>/. Strict
exit-code order: <5 entries -> Exit(2) (>=5 minimum) before any dispatcher;
--dry-run prints the call budget + 'model: default (resolved by claude CLI)'
and Exit(0) with zero quota; claude-absent -> Exit(1). run_ts is generated at
the CLI boundary (clock-free core). `view <skill> [--result --open]` renders a
stored result. --open is best-effort (never errors the run). Without --apply,
nothing outside .map/ is touched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng note

New map-plan (9), map-efficient (9), and larger map-debug (10) eval-sets for
`skill-eval optimize`, each splittable 60/40 with n_test>=3 so the held-out
test pass-rate is meaningful (a <5-entry set degenerates to a 0/1 coin-flip).
README documents the purpose + >=8 sizing rationale. The 3-entry
map_debug_eval_set.json smoke fixture is left untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…docs

Edited the .jinja sources (SKILL.md.jinja + skill-rules.json.jinja) to document
`skill-eval optimize <skill> --eval-set --iterations --apply --open --dry-run`
(anti-overfit held-out selection, propose-only default, two no-op cases) and
`skill-eval view <skill> --result --open`, plus new skill-rules keywords/
intentPatterns; requires-cmd stays ["claude"]. Ran `make render-templates`;
check-render + skills-consistency + template-render green.

Also re-syncs the generated .map/scripts/map_orchestrator.py to its .jinja
source: the gitignore-tolerance in record_subtask_result was already in
templates_src but the committed generated file was stale (.map/ is not in
check-render's gate, so the drift went uncaught). render-templates corrects it.
Follow-up (noted): add .map/scripts/ to check-render's gate to catch such drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… green

The INV-1/AC-10 AST guard now covers proposer, description_optimizer, viewer,
and apply_patcher (no `import anthropic`, no ANTHROPIC_API_KEY). Integration
gate verified: `make check` green (2247 passed), `make check-render` green,
pyright 0/0/0 on every touched Phase F source + test file. uv-run confirmed to
resolve to this worktree, so the green reflects the real Phase F code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tener

The eval dispatcher's claude -p subprocess inherits the user's telegram-bridge
plugin SessionStart hook, which injects an "always-listen — run `tg listen`"
instruction. When the eval agent obeys it, `tg listen` blocks on the Telegram
long-poll until the dispatch timeout, so a triggered-skill cell mis-records as a
non-trigger (and the run wall-clock explodes — observed cells hitting the full
120s/3600s timeout).

Fix: `_eval_subprocess_env(cwd)` sets TG_STATE_DIR to a config-less path inside
the throwaway eval cwd. Any `tg listen`/`tg send` the agent runs inherits this
env, finds no config.json, and exits immediately (`die`) instead of blocking —
neutralising the hang. The operator's real ~/.claude/telegram config is never
touched (per-subprocess override on a path removed with the temp cwd). Plugin
hooks run in a restricted env so the cosmetic injection may still appear, but it
is inert. MAP_INVOKED_BY guard preserved.

Verified end-to-end: previously-hanging prompts now finish in ~80-160s through
the real ClaudeSubprocessDispatcher with the skill still triggering.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on cap to the spec (1024)

Applies the optimizer's winning map-plan description (the candidate selected on
held-out TEST pass-rate: 25%->50% vs baseline, no overfit) to the single-source
.jinja and re-renders byte-identical trees.

The winner is 622 chars, which the old self-imposed 250-char test rejected. That
250 was a transient Claude Code v2.1.86 listing cap — raised to 1536 in v2.1.105
and replaced by a usage-ranked listing budget in v2.1.129+. The Agent Skills
spec maximum for `description` is 1024 chars, and the model loads the FULL
description for triggering. So 250 was stale; the test now enforces the real
1024-char spec limit.

The proposer is constrained to <=1024 chars (prompt instruction + hard rejection
of an over-limit candidate) so the optimizer only ever proves a shippable
description. Refs: anthropics/claude-code #40121, #47627; code.claude.com/docs/en/skills.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@azalio azalio merged commit 1b1d064 into main Jun 4, 2026
6 checks passed
@azalio azalio deleted the smoke-cobalt branch June 4, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant