azalio · azalio · Jun 3, 2026 · Jun 3, 2026 · Jun 3, 2026 · Jun 3, 2026
diff --git a/.claude/hooks/safety-guardrails.py b/.claude/hooks/safety-guardrails.py
@@ -38,7 +38,12 @@
 
 # Dangerous bash command patterns
 _DEFAULT_DANGEROUS_COMMANDS = [
-    r"rm\s+-rf\s+/",  # rm -rf /
+    # Block `rm -rf /` (bare root), `rm -rf /etc`, `rm -rf /home/user`, etc.,
+    # but ALLOW deletion of subpaths UNDER a temp root (rm -rf /tmp/<dir>,
+    # /private/tmp/<dir>, /var/folders/<dir>, /var/tmp/<dir>) — legitimate
+    # scratch cleanup. The negative lookahead requires a trailing slash, so the
+    # temp root itself (`rm -rf /tmp`) stays blocked; only children are allowed.
+    r"rm\s+-rf\s+/(?!(?:tmp|private/tmp|var/folders|var/tmp)/)",  # rm -rf / (non-temp)
     r"rm\s+-rf\s+\*",  # rm -rf *
     r"rm\s+-rf\s+\.\.",  # rm -rf ..
     r"git\s+push.*--force.*main",

diff --git a/.claude/rules/learned/architecture-patterns.md b/.claude/rules/learned/architecture-patterns.md
@@ -161,3 +161,19 @@
   # CORRECT: templates_src is fence-free; copier injects exactly once:
   wrapped = f"# map:start\n{rendered}\n# map:end\n" if fenced else rendered
   ```
+
+- **Spike-First Gating: High-Risk Binding Decisions Require a Docs-Only Artifact Before Implementation** (2026-06-04): When a subtask's answer would bind downstream implementation (which channel carries a value, which API call is idempotent, what schema a subprocess emits), run it FIRST as a docs-only spike that writes an artifact naming the empirical answer + the binding strategy, and commits ZERO production code. Downstream subtasks reference the artifact by name and consume it, not assumptions. A wrong assumption that is not spiked propagates into every component built on it and forces a rewrite cascade. In this workflow a research-agent wrongly claimed skill-activation wasn't recoverable from `claude -p`; the ST-001 spike empirically corrected it before any dispatcher code existed. The spike artifact MUST contain a named "binding strategy" section, not just findings (Monitor hard-stopped once for a missing strategy section). [workflow: map-efficient]
+
+- **Producer-Owns-Parse: The Component That Owns the Subprocess Owns All Derived Fields; Consumers Read the Typed Result** (2026-06-04): When component A launches a subprocess (or owns a raw source) and component B consumes the result, ALL parsing/derivation (transcript reads, field extraction, signal combination) lives in A; B reads only the typed result struct and never re-implements parsing. Two payoffs: (1) a single parse site that a Mock producer can supply directly, so consumer tests need no subprocess/transcript fixture; (2) when the raw output schema changes, only A changes. Putting any parse in B re-couples the modules through the raw format. Extends "Contract-First Inter-Component JSON Schemas": the contract is A's typed struct, and the parse-to-struct boundary is A's responsibility exclusively. [workflow: map-efficient]
+  ```python
+  # WRONG — runner re-parses a transcript it does not own (couples to raw format)
+  result = dispatcher.dispatch(cell)            # raw proc output
+  skill = extract_skill_from_transcript(read_jsonl(result.session_id))
+
+  # CORRECT — dispatcher parses once into a typed field; runner just reads it
+  @dataclass
+  class DispatchResult:
+      triggered_skill: str | None   # parsed by dispatcher, NOT by runner
+      token_usage: TokenUsage | None
+  # tests inject MockDispatcher(triggered_skill="map-plan") — no subprocess needed
+  ```
diff --git a/.claude/rules/learned/implementation-patterns.md b/.claude/rules/learned/implementation-patterns.md
@@ -128,3 +128,42 @@ paths:
       dest.chmod(dest.stat().st_mode | 0o755)
   # test guard: assert os.access(installed_hook, os.X_OK)
   ```
+
+- **`claude -p` Output Has Two Channels: Envelope for Tokens, Transcript JSONL for Skill Name** (2026-06-04): When shelling `claude -p --output-format json` as a subprocess, two distinct output channels carry different information — do not confuse them. The JSON result envelope (stdout) carries `.result` (response text), `.usage` (input/output/cache tokens), and `.session_id`. The name of the skill/slash-surface that actually fired is NOT in the envelope — it is only in Claude Code's native transcript JSONL (located by session_id) as a `tool_use` block with `name=="Skill"` and `input.skill`. Deriving this from the framework's own scratch/digest schema rather than the native transcript yields a wrong claim. Verify empirically by reading the real transcript after a spike call; never infer from internal schema files. [workflow: map-efficient]
+  ```python
+  env = json.loads(proc.stdout)        # .result, .usage, .session_id
+  tokens = env["usage"]                # CORRECT — tokens are in the envelope
+  # env.get("skill")  -> None          # WRONG — fired-skill is NOT in the envelope
+  for line in transcript_jsonl(env["session_id"]).read_text().splitlines():
+      m = json.loads(line)
+      if m.get("type") == "tool_use" and m.get("name") == "Skill":
+          triggered = m["input"]["skill"]; break
+  ```
+
+- **Scoped Config-Flag Mutation: Seed a Throwaway Temp Copy; Never Modify the Production Source of Truth** (2026-06-04): When a tool/test needs a shipped config flag to behave differently from its production default (e.g. stripping `disable-model-invocation: true` so an eval can auto-select skills), mutate the flag ONLY in a throwaway temp dir seeded with a copy of the production config, discarded after the subprocess exits. Never patch the source repo or `templates_src`. A blanket production flip is a footgun: it silently changes behavior for every other user of the flag and may be committed accidentally. Scope of mutation must match scope of need: one subprocess call → one throwaway dir, always cleaned up in `finally`. [workflow: map-efficient]
+  ```python
+  tmp = Path(tempfile.mkdtemp())
+  shutil.copytree(REPO / ".claude", tmp / ".claude")     # seed from production
+  strip_flag(tmp / ".claude" / "skills")                 # mutate throwaway ONLY
+  try:
+      subprocess.run(["claude", "-p", prompt, "--output-format", "json"], cwd=tmp)
+  finally:
+      shutil.rmtree(tmp)                                  # production never touched
+  ```
+
+- **Clock-Free Core with Caller-Supplied Path: Inject Timestamps at the CLI Boundary, Not Inside the Worker** (2026-06-04): When a worker writes durable output (a timestamped JSONL, a run artifact), do NOT call `datetime.now()` inside the worker. Have the CLI/outermost caller generate the timestamped path and pass it as an explicit `out_path: Path` the worker treats as opaque. Benefits: (1) tests pass `tmp_path / "results.jsonl"` with zero clock monkeypatching; (2) the worker is deterministic given the same inputs+path; (3) resume keys on the path the CLI owns. Refines "Long-Running Operations Need Durable State by Default" by fixing WHERE path/timestamp generation lives — at the boundary, not the core. [workflow: map-efficient]
+  ```python
+  # CORRECT: worker takes out_path; CLI owns the timestamp
+  def run_eval(*, entries, dispatcher, runs, out_path: Path, resume=False) -> list: ...
+  # CLI: out = default_run_path(root, skill, datetime.now(tz).strftime("%Y%m%dT%H%M%SZ"))
+  # Test: run_eval(..., out_path=tmp_path / "r.jsonl")   # no time mocking
+  ```
+
+- **Concurrent Durable Append: threading.Lock for Line Integrity + Stable cell_id Resume Key** (2026-06-04): When parallel workers append JSONL lines to a shared durable file, two invariants must BOTH hold: (1) no interleaved partial lines — guard each `f.write(line + "\n")` with a threading.Lock; (2) resume is idempotent regardless of write order — key on a stable id present in every record (cell_id), never on line number/position. Nondeterministic write order is fine as long as resume dedups by id. Each worker subprocess also runs in its own temp cwd so concurrent subprocesses never share a working dir. Complements "Long-Running Operations Need Durable State" (process-restart durability) with within-process concurrency safety. [workflow: map-efficient]
+  ```python
+  with self._lock:                       # atomic per-line append
+      with out_path.open("a", encoding="utf-8") as f:
+          f.write(json.dumps(record) + "\n")
+  done = {json.loads(l)["cell_id"] for l in out_path.read_text().splitlines() if l.strip()}
+  pending = [c for c in cells if make_cell_id(...) not in done]   # order-independent resume
+  ```
diff --git a/.claude/rules/learned/testing-strategies.md b/.claude/rules/learned/testing-strategies.md
@@ -139,3 +139,12 @@ paths:
   # 3. git restore <file>        -> confirm GREEN
   # 4. commit file + test together
   ```
+
+- **Blueprint-Named Test Functions Are a Monitor Contract: Author Them in the Same Subtask as the Code** (2026-06-04): When a subtask blueprint's `test_strategy` names specific pytest function names (e.g. `test_vc3_resume_skips_present_cell_ids`), Monitor treats those names as a HARD completeness contract: a subtask whose logic is correct but whose blueprint-named functions do not yet exist gets `valid=false` (hard stop). The completeness unit is code + named-test-functions-together, not code alone — the blueprint author chose the names to specify observable behavior, so an absent name means the behavior is unverified. Never stub a named test with `pass`/`# TODO` and call the subtask done; the stub satisfies the import but not the contract. In this workflow ST-005's runner code was correct but Monitor hard-stopped until the four named VC tests were authored with real assertions. [workflow: map-efficient]
+
+- **Final Verification Must Check Shipped Docs Against Actual Behavior, Then Grep for the Same Drift Class** (2026-06-04): After code+tests are green, a dedicated final-verification pass must validate that user-facing docs (SKILL.md, README, CLI `--help`) match actual behavior: default values, accepted schema formats, flag names, output field names. Prose drift is invisible to pytest/ruff/mypy. When the first drift instance is found, immediately grep the WHOLE doc for the same class of claim (every `--flag default`, every schema example, every accepted file-format mention) before moving on — drift clusters because the doc was written once from a design doc, not from running code. Here the final-verifier caught a `--max-concurrency` default of 4 (actual 1); grepping the same file then surfaced a fictional YAML eval-set schema block + `.yaml` examples that the JSON-only loader could never parse. [workflow: map-efficient]
+  ```bash
+  # one drift found -> grep the whole doc for the drift class before marking done
+  mapify skill-eval --help | grep -i max-concurrency        # actual default
+  grep -nE 'default|yaml|schema|--[a-z-]+' docs/SKILL.md     # reconcile every claim
+  ```
diff --git a/.claude/skills/map-efficient/SKILL.md b/.claude/skills/map-efficient/SKILL.md
@@ -191,8 +191,10 @@ python3 .map/scripts/map_step_runner.py record_test_baseline "$BRANCH"
 
 Snapshots pre-existing failures so later subtasks distinguish
 "introduced regression" from "was broken pre-plan". Auto-detects
-Make/pytest/go test/cargo. Overrides + narrow-target guidance:
-[efficient-reference.md](efficient-reference.md#pre-flight-test-baseline).
+Make/pytest/go test/cargo. It captures the test run internally and prints a
+single compact JSON report at the end — read that JSON directly; do NOT pipe it
+through `head`/`tail` (per the repo bash guidelines). Overrides + narrow-target
+guidance: [efficient-reference.md](efficient-reference.md#pre-flight-test-baseline).
 
 ### Wave Computation (after INIT_STATE) - REQUIRED
 

diff --git a/.claude/skills/map-efficient/efficient-reference.md b/.claude/skills/map-efficient/efficient-reference.md
@@ -203,6 +203,11 @@ fix or defer.
 python3 .map/scripts/map_step_runner.py record_test_baseline "$BRANCH"
 ```
 
+It captures the test run internally and prints a single compact JSON report at
+the end — read that JSON directly. Do NOT pipe it through `head`/`tail` (per the
+repo bash guidelines); the output is one small object, not a stream, so
+truncating it only hides fields.
+
 Auto-detects from project markers:
 - `Makefile` with `test:` target → `make test`
 - `pyproject.toml` / `pytest.ini` → `pytest`

diff --git a/.claude/skills/map-skill-eval/SKILL.md b/.claude/skills/map-skill-eval/SKILL.md
@@ -0,0 +1,94 @@
+---
+name: map-skill-eval
+description: |
+  Evaluate a /map-* skill's trigger accuracy and cost. Use when asked to measure skill trigger accuracy, run an eval-set, or check token/duration cost via `mapify skill-eval`. Do NOT use to plan or implement; use map-plan or map-efficient.
+effort: medium
+disable-model-invocation: true
+argument-hint: "[skill] [--eval-set PATH]"
+---
+# /map-skill-eval — Skill Trigger Accuracy & Cost Evaluation
+
+Purpose: measure whether a `/map-*` skill fires on the right prompts and what it costs in tokens and time. Do not plan or implement from this skill.
+
+Requires the `claude` CLI (installed and on `$PATH`). The skill is skipped at install time on hosts without `claude`.
+
+## Invocation
+
+```bash
+mapify skill-eval run <skill> --eval-set PATH [--dry-run] [--resume] [--max-concurrency N]
+```
+
+- `<skill>` — the skill name to evaluate (e.g. `map-plan`).
+- `--eval-set PATH` — path to a JSON eval-set file defining prompt cases and expected assertions.
+- `--dry-run` — validate the eval-set and print the planned run count without spending any quota.
+- `--resume` — continue an interrupted run from the last durable checkpoint.
+- `--max-concurrency N` — max parallel `claude -p` workers (default: 1).
+
+## What It Does
+
+1. **Prompts × runs matrix** — for each case in the eval-set, invokes `claude -p` in an isolated temporary working directory seeded with `.claude/` (skills, settings). Runs are independent; no shared state leaks between cases.
+2. **Transcript-parse trigger detection** — parses each `claude -p` transcript to determine whether the target skill fired (trigger) or did not fire (not_trigger).
+3. **Deterministic assertions** — each eval case may specify one or more assertion types:
+   - `contains` / `not_contains` — substring presence in the response.
+   - `regex` — pattern match against the response.
+   - `valid_json` — response parses as JSON.
+   - `trigger` / `not_trigger` — skill fired / did not fire.
+4. **Durable resumable run log** — results are appended to `.map/eval-runs/<skill>/<timestamp>.jsonl` as each case completes, so a partial run is recoverable via `--resume`.
+5. **Summary report** — after all cases complete, prints pass-rate (passed/total) plus per-case token usage, duration, and cache-hit stats.
+
+## Eval-Set Format
+
+A JSON object with an `entries` array. Each entry has a `prompt`, optional
+`should_trigger` / `should_not_trigger` skill names (the runner turns these into
+`trigger` / `not_trigger` assertions), and an optional `assertions` array.
+Assertion types: `contains`, `not_contains`, `regex`, `valid_json`, `trigger`,
+`not_trigger`.
+
+```json
+{
+  "entries": [
+    {
+      "prompt": "Decompose this feature into subtasks",
+      "should_trigger": "map-plan",
+      "assertions": [
+        { "type": "contains", "value": "subtask" }
+      ]
+    },
+    {
+      "prompt": "Run quality gates",
+      "should_not_trigger": "map-plan",
+      "assertions": []
+    }
+  ]
+}
+```
+
+## --dry-run
+
+`--dry-run` validates the eval-set schema and prints the planned case count with estimated quota usage. No `claude -p` calls are made; no `.jsonl` is written.
+
+## Examples
+
+```bash
+# Validate eval-set without spending quota
+mapify skill-eval run map-plan --eval-set .map/evals/map-plan.json --dry-run
+
+# Run full eval with up to 8 parallel workers
+mapify skill-eval run map-plan --eval-set .map/evals/map-plan.json --max-concurrency 8
+
+# Resume an interrupted run
+mapify skill-eval run map-plan --eval-set .map/evals/map-plan.json --resume
+```
+
+## Troubleshooting
+
+- **`claude` not found** — `map-skill-eval` requires the `claude` CLI on `$PATH`. Install it and re-run `mapify init` to activate the skill.
+- **Eval-set validation error on `--dry-run`** — check that each case has a non-empty `id`, a `prompt`, and at least one `assertions` entry with a valid `type`.
+- **Run log not found for `--resume`** — `--resume` looks for the latest `.map/eval-runs/<skill>/<timestamp>.jsonl`. If no prior run exists, omit `--resume` to start fresh.
+- **All cases report `not_trigger` unexpectedly** — verify the skill name matches exactly (e.g. `map-plan`, not `map_plan`) and that `.claude/` was seeded correctly in the temp cwd.
+
+## Related Commands
+
+- `/map-plan` — plan and decompose tasks.
+- `/map-efficient` — full MAP workflow execution.
+- `/map-check` — run quality gates and verify MAP workflow completion.
diff --git a/.claude/skills/skill-rules.json b/.claude/skills/skill-rules.json
@@ -239,6 +239,18 @@
         ]
       }
     },
+    "map-skill-eval": {
+      "type": "manual",
+      "skillClass": "task",
+      "enforcement": "manual",
+      "priority": "medium",
+      "description": "Evaluate a /map-* skill's trigger accuracy + cost via mapify skill-eval (claude -p matrix, deterministic assertions, durable resumable runs).",
+      "requires-cmd": ["claude"],
+      "promptTriggers": {
+        "keywords": ["map-skill-eval","skill-eval","skill eval","evaluate skill","trigger accuracy","skill triggering"],
+        "intentPatterns": ["map-skill-eval","(eval|evaluate|measure|test).*(skill).*(trigger|fire|cost)","does .* skill trigger"]
+      }
+    },
     "map-task": {
       "type": "manual",
       "skillClass": "task",

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -65,6 +65,7 @@ Validation:
 - "Not in the CI gate" is NOT a valid reason to skip. The error is real if any tool reported it.
 - "Static-analysis noise" is NOT a category. Either the type system is correct and the code is wrong, or the annotation needs fixing — pick one and fix it.
 - Only legitimate skip: the user explicitly approves deferral in the current conversation. Document the deferral in writing.
+- **Any error encountered while operating the MAP Framework itself must be fixed immediately, in the same change.** This covers the framework's own runtime — a hook that crashes or false-positives, a `.map/scripts/` runner or gate that errors or mis-reports, a `mapify` CLI traceback, a render/validator/blueprint failure, a broken `Task`/agent dispatch. When you hit one mid-task: STOP, find the root cause, and fix it before continuing the original work. Do NOT work around it, do NOT defer it as "unrelated", do NOT note-and-move-on past a broken tool. If the fix is genuinely out of scope or risky, stop and ask the user — never silently continue past a malfunctioning framework component. (Errors raised by an external plugin/hook NOT shipped by this repo are out of scope here; say so and route them to the user.)
 
 ## Bash Command Guidelines