Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .claude/hooks/safety-guardrails.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,12 @@

# Dangerous bash command patterns
_DEFAULT_DANGEROUS_COMMANDS = [
r"rm\s+-rf\s+/", # rm -rf /
# Block `rm -rf /` (bare root), `rm -rf /etc`, `rm -rf /home/user`, etc.,
# but ALLOW deletion of subpaths UNDER a temp root (rm -rf /tmp/<dir>,
# /private/tmp/<dir>, /var/folders/<dir>, /var/tmp/<dir>) — legitimate
# scratch cleanup. The negative lookahead requires a trailing slash, so the
# temp root itself (`rm -rf /tmp`) stays blocked; only children are allowed.
r"rm\s+-rf\s+/(?!(?:tmp|private/tmp|var/folders|var/tmp)/)", # rm -rf / (non-temp)
r"rm\s+-rf\s+\*", # rm -rf *
r"rm\s+-rf\s+\.\.", # rm -rf ..
r"git\s+push.*--force.*main",
Expand Down
16 changes: 16 additions & 0 deletions .claude/rules/learned/architecture-patterns.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,3 +161,19 @@
# CORRECT: templates_src is fence-free; copier injects exactly once:
wrapped = f"# map:start\n{rendered}\n# map:end\n" if fenced else rendered
```

- **Spike-First Gating: High-Risk Binding Decisions Require a Docs-Only Artifact Before Implementation** (2026-06-04): When a subtask's answer would bind downstream implementation (which channel carries a value, which API call is idempotent, what schema a subprocess emits), run it FIRST as a docs-only spike that writes an artifact naming the empirical answer + the binding strategy, and commits ZERO production code. Downstream subtasks reference the artifact by name and consume it, not assumptions. A wrong assumption that is not spiked propagates into every component built on it and forces a rewrite cascade. In this workflow a research-agent wrongly claimed skill-activation wasn't recoverable from `claude -p`; the ST-001 spike empirically corrected it before any dispatcher code existed. The spike artifact MUST contain a named "binding strategy" section, not just findings (Monitor hard-stopped once for a missing strategy section). [workflow: map-efficient]

- **Producer-Owns-Parse: The Component That Owns the Subprocess Owns All Derived Fields; Consumers Read the Typed Result** (2026-06-04): When component A launches a subprocess (or owns a raw source) and component B consumes the result, ALL parsing/derivation (transcript reads, field extraction, signal combination) lives in A; B reads only the typed result struct and never re-implements parsing. Two payoffs: (1) a single parse site that a Mock producer can supply directly, so consumer tests need no subprocess/transcript fixture; (2) when the raw output schema changes, only A changes. Putting any parse in B re-couples the modules through the raw format. Extends "Contract-First Inter-Component JSON Schemas": the contract is A's typed struct, and the parse-to-struct boundary is A's responsibility exclusively. [workflow: map-efficient]
```python
# WRONG — runner re-parses a transcript it does not own (couples to raw format)
result = dispatcher.dispatch(cell) # raw proc output
skill = extract_skill_from_transcript(read_jsonl(result.session_id))

# CORRECT — dispatcher parses once into a typed field; runner just reads it
@dataclass
class DispatchResult:
triggered_skill: str | None # parsed by dispatcher, NOT by runner
token_usage: TokenUsage | None
# tests inject MockDispatcher(triggered_skill="map-plan") — no subprocess needed
```
39 changes: 39 additions & 0 deletions .claude/rules/learned/implementation-patterns.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,3 +128,42 @@ paths:
dest.chmod(dest.stat().st_mode | 0o755)
# test guard: assert os.access(installed_hook, os.X_OK)
```

- **`claude -p` Output Has Two Channels: Envelope for Tokens, Transcript JSONL for Skill Name** (2026-06-04): When shelling `claude -p --output-format json` as a subprocess, two distinct output channels carry different information — do not confuse them. The JSON result envelope (stdout) carries `.result` (response text), `.usage` (input/output/cache tokens), and `.session_id`. The name of the skill/slash-surface that actually fired is NOT in the envelope — it is only in Claude Code's native transcript JSONL (located by session_id) as a `tool_use` block with `name=="Skill"` and `input.skill`. Deriving this from the framework's own scratch/digest schema rather than the native transcript yields a wrong claim. Verify empirically by reading the real transcript after a spike call; never infer from internal schema files. [workflow: map-efficient]
```python
env = json.loads(proc.stdout) # .result, .usage, .session_id
tokens = env["usage"] # CORRECT — tokens are in the envelope
# env.get("skill") -> None # WRONG — fired-skill is NOT in the envelope
for line in transcript_jsonl(env["session_id"]).read_text().splitlines():
m = json.loads(line)
if m.get("type") == "tool_use" and m.get("name") == "Skill":
triggered = m["input"]["skill"]; break
```

- **Scoped Config-Flag Mutation: Seed a Throwaway Temp Copy; Never Modify the Production Source of Truth** (2026-06-04): When a tool/test needs a shipped config flag to behave differently from its production default (e.g. stripping `disable-model-invocation: true` so an eval can auto-select skills), mutate the flag ONLY in a throwaway temp dir seeded with a copy of the production config, discarded after the subprocess exits. Never patch the source repo or `templates_src`. A blanket production flip is a footgun: it silently changes behavior for every other user of the flag and may be committed accidentally. Scope of mutation must match scope of need: one subprocess call → one throwaway dir, always cleaned up in `finally`. [workflow: map-efficient]
```python
tmp = Path(tempfile.mkdtemp())
shutil.copytree(REPO / ".claude", tmp / ".claude") # seed from production
strip_flag(tmp / ".claude" / "skills") # mutate throwaway ONLY
try:
subprocess.run(["claude", "-p", prompt, "--output-format", "json"], cwd=tmp)
finally:
shutil.rmtree(tmp) # production never touched
```

- **Clock-Free Core with Caller-Supplied Path: Inject Timestamps at the CLI Boundary, Not Inside the Worker** (2026-06-04): When a worker writes durable output (a timestamped JSONL, a run artifact), do NOT call `datetime.now()` inside the worker. Have the CLI/outermost caller generate the timestamped path and pass it as an explicit `out_path: Path` the worker treats as opaque. Benefits: (1) tests pass `tmp_path / "results.jsonl"` with zero clock monkeypatching; (2) the worker is deterministic given the same inputs+path; (3) resume keys on the path the CLI owns. Refines "Long-Running Operations Need Durable State by Default" by fixing WHERE path/timestamp generation lives — at the boundary, not the core. [workflow: map-efficient]
```python
# CORRECT: worker takes out_path; CLI owns the timestamp
def run_eval(*, entries, dispatcher, runs, out_path: Path, resume=False) -> list: ...
# CLI: out = default_run_path(root, skill, datetime.now(tz).strftime("%Y%m%dT%H%M%SZ"))
# Test: run_eval(..., out_path=tmp_path / "r.jsonl") # no time mocking
```

- **Concurrent Durable Append: threading.Lock for Line Integrity + Stable cell_id Resume Key** (2026-06-04): When parallel workers append JSONL lines to a shared durable file, two invariants must BOTH hold: (1) no interleaved partial lines — guard each `f.write(line + "\n")` with a threading.Lock; (2) resume is idempotent regardless of write order — key on a stable id present in every record (cell_id), never on line number/position. Nondeterministic write order is fine as long as resume dedups by id. Each worker subprocess also runs in its own temp cwd so concurrent subprocesses never share a working dir. Complements "Long-Running Operations Need Durable State" (process-restart durability) with within-process concurrency safety. [workflow: map-efficient]
```python
with self._lock: # atomic per-line append
with out_path.open("a", encoding="utf-8") as f:
f.write(json.dumps(record) + "\n")
done = {json.loads(l)["cell_id"] for l in out_path.read_text().splitlines() if l.strip()}
pending = [c for c in cells if make_cell_id(...) not in done] # order-independent resume
```
9 changes: 9 additions & 0 deletions .claude/rules/learned/testing-strategies.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,12 @@ paths:
# 3. git restore <file> -> confirm GREEN
# 4. commit file + test together
```

- **Blueprint-Named Test Functions Are a Monitor Contract: Author Them in the Same Subtask as the Code** (2026-06-04): When a subtask blueprint's `test_strategy` names specific pytest function names (e.g. `test_vc3_resume_skips_present_cell_ids`), Monitor treats those names as a HARD completeness contract: a subtask whose logic is correct but whose blueprint-named functions do not yet exist gets `valid=false` (hard stop). The completeness unit is code + named-test-functions-together, not code alone — the blueprint author chose the names to specify observable behavior, so an absent name means the behavior is unverified. Never stub a named test with `pass`/`# TODO` and call the subtask done; the stub satisfies the import but not the contract. In this workflow ST-005's runner code was correct but Monitor hard-stopped until the four named VC tests were authored with real assertions. [workflow: map-efficient]

- **Final Verification Must Check Shipped Docs Against Actual Behavior, Then Grep for the Same Drift Class** (2026-06-04): After code+tests are green, a dedicated final-verification pass must validate that user-facing docs (SKILL.md, README, CLI `--help`) match actual behavior: default values, accepted schema formats, flag names, output field names. Prose drift is invisible to pytest/ruff/mypy. When the first drift instance is found, immediately grep the WHOLE doc for the same class of claim (every `--flag default`, every schema example, every accepted file-format mention) before moving on — drift clusters because the doc was written once from a design doc, not from running code. Here the final-verifier caught a `--max-concurrency` default of 4 (actual 1); grepping the same file then surfaced a fictional YAML eval-set schema block + `.yaml` examples that the JSON-only loader could never parse. [workflow: map-efficient]
```bash
# one drift found -> grep the whole doc for the drift class before marking done
mapify skill-eval --help | grep -i max-concurrency # actual default
grep -nE 'default|yaml|schema|--[a-z-]+' docs/SKILL.md # reconcile every claim
```
6 changes: 4 additions & 2 deletions .claude/skills/map-efficient/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,8 +191,10 @@ python3 .map/scripts/map_step_runner.py record_test_baseline "$BRANCH"

Snapshots pre-existing failures so later subtasks distinguish
"introduced regression" from "was broken pre-plan". Auto-detects
Make/pytest/go test/cargo. Overrides + narrow-target guidance:
[efficient-reference.md](efficient-reference.md#pre-flight-test-baseline).
Make/pytest/go test/cargo. It captures the test run internally and prints a
single compact JSON report at the end — read that JSON directly; do NOT pipe it
through `head`/`tail` (per the repo bash guidelines). Overrides + narrow-target
guidance: [efficient-reference.md](efficient-reference.md#pre-flight-test-baseline).

### Wave Computation (after INIT_STATE) - REQUIRED

Expand Down
5 changes: 5 additions & 0 deletions .claude/skills/map-efficient/efficient-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,11 @@ fix or defer.
python3 .map/scripts/map_step_runner.py record_test_baseline "$BRANCH"
```

It captures the test run internally and prints a single compact JSON report at
the end — read that JSON directly. Do NOT pipe it through `head`/`tail` (per the
repo bash guidelines); the output is one small object, not a stream, so
truncating it only hides fields.

Auto-detects from project markers:
- `Makefile` with `test:` target → `make test`
- `pyproject.toml` / `pytest.ini` → `pytest`
Expand Down
94 changes: 94 additions & 0 deletions .claude/skills/map-skill-eval/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
name: map-skill-eval
description: |
Evaluate a /map-* skill's trigger accuracy and cost. Use when asked to measure skill trigger accuracy, run an eval-set, or check token/duration cost via `mapify skill-eval`. Do NOT use to plan or implement; use map-plan or map-efficient.
effort: medium
disable-model-invocation: true
argument-hint: "[skill] [--eval-set PATH]"
---
# /map-skill-eval — Skill Trigger Accuracy & Cost Evaluation

Purpose: measure whether a `/map-*` skill fires on the right prompts and what it costs in tokens and time. Do not plan or implement from this skill.

Requires the `claude` CLI (installed and on `$PATH`). The skill is skipped at install time on hosts without `claude`.

## Invocation

```bash
mapify skill-eval run <skill> --eval-set PATH [--dry-run] [--resume] [--max-concurrency N]
```

- `<skill>` — the skill name to evaluate (e.g. `map-plan`).
- `--eval-set PATH` — path to a JSON eval-set file defining prompt cases and expected assertions.
- `--dry-run` — validate the eval-set and print the planned run count without spending any quota.
- `--resume` — continue an interrupted run from the last durable checkpoint.
- `--max-concurrency N` — max parallel `claude -p` workers (default: 1).

## What It Does

1. **Prompts × runs matrix** — for each case in the eval-set, invokes `claude -p` in an isolated temporary working directory seeded with `.claude/` (skills, settings). Runs are independent; no shared state leaks between cases.
2. **Transcript-parse trigger detection** — parses each `claude -p` transcript to determine whether the target skill fired (trigger) or did not fire (not_trigger).
3. **Deterministic assertions** — each eval case may specify one or more assertion types:
- `contains` / `not_contains` — substring presence in the response.
- `regex` — pattern match against the response.
- `valid_json` — response parses as JSON.
- `trigger` / `not_trigger` — skill fired / did not fire.
4. **Durable resumable run log** — results are appended to `.map/eval-runs/<skill>/<timestamp>.jsonl` as each case completes, so a partial run is recoverable via `--resume`.
5. **Summary report** — after all cases complete, prints pass-rate (passed/total) plus per-case token usage, duration, and cache-hit stats.

## Eval-Set Format

A JSON object with an `entries` array. Each entry has a `prompt`, optional
`should_trigger` / `should_not_trigger` skill names (the runner turns these into
`trigger` / `not_trigger` assertions), and an optional `assertions` array.
Assertion types: `contains`, `not_contains`, `regex`, `valid_json`, `trigger`,
`not_trigger`.

```json
{
"entries": [
{
"prompt": "Decompose this feature into subtasks",
"should_trigger": "map-plan",
"assertions": [
{ "type": "contains", "value": "subtask" }
]
},
{
"prompt": "Run quality gates",
"should_not_trigger": "map-plan",
"assertions": []
}
]
}
```

## --dry-run

`--dry-run` validates the eval-set schema and prints the planned case count with estimated quota usage. No `claude -p` calls are made; no `.jsonl` is written.

## Examples

```bash
# Validate eval-set without spending quota
mapify skill-eval run map-plan --eval-set .map/evals/map-plan.json --dry-run

# Run full eval with up to 8 parallel workers
mapify skill-eval run map-plan --eval-set .map/evals/map-plan.json --max-concurrency 8

# Resume an interrupted run
mapify skill-eval run map-plan --eval-set .map/evals/map-plan.json --resume
```

## Troubleshooting

- **`claude` not found** — `map-skill-eval` requires the `claude` CLI on `$PATH`. Install it and re-run `mapify init` to activate the skill.
- **Eval-set validation error on `--dry-run`** — check that each case has a non-empty `id`, a `prompt`, and at least one `assertions` entry with a valid `type`.
- **Run log not found for `--resume`** — `--resume` looks for the latest `.map/eval-runs/<skill>/<timestamp>.jsonl`. If no prior run exists, omit `--resume` to start fresh.
- **All cases report `not_trigger` unexpectedly** — verify the skill name matches exactly (e.g. `map-plan`, not `map_plan`) and that `.claude/` was seeded correctly in the temp cwd.

## Related Commands

- `/map-plan` — plan and decompose tasks.
- `/map-efficient` — full MAP workflow execution.
- `/map-check` — run quality gates and verify MAP workflow completion.
12 changes: 12 additions & 0 deletions .claude/skills/skill-rules.json
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,18 @@
]
}
},
"map-skill-eval": {
"type": "manual",
"skillClass": "task",
"enforcement": "manual",
"priority": "medium",
"description": "Evaluate a /map-* skill's trigger accuracy + cost via mapify skill-eval (claude -p matrix, deterministic assertions, durable resumable runs).",
"requires-cmd": ["claude"],
"promptTriggers": {
"keywords": ["map-skill-eval","skill-eval","skill eval","evaluate skill","trigger accuracy","skill triggering"],
"intentPatterns": ["map-skill-eval","(eval|evaluate|measure|test).*(skill).*(trigger|fire|cost)","does .* skill trigger"]
}
},
"map-task": {
"type": "manual",
"skillClass": "task",
Expand Down
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ Validation:
- "Not in the CI gate" is NOT a valid reason to skip. The error is real if any tool reported it.
- "Static-analysis noise" is NOT a category. Either the type system is correct and the code is wrong, or the annotation needs fixing — pick one and fix it.
- Only legitimate skip: the user explicitly approves deferral in the current conversation. Document the deferral in writing.
- **Any error encountered while operating the MAP Framework itself must be fixed immediately, in the same change.** This covers the framework's own runtime — a hook that crashes or false-positives, a `.map/scripts/` runner or gate that errors or mis-reports, a `mapify` CLI traceback, a render/validator/blueprint failure, a broken `Task`/agent dispatch. When you hit one mid-task: STOP, find the root cause, and fix it before continuing the original work. Do NOT work around it, do NOT defer it as "unrelated", do NOT note-and-move-on past a broken tool. If the fix is genuinely out of scope or risky, stop and ask the user — never silently continue past a malfunctioning framework component. (Errors raised by an external plugin/hook NOT shipped by this repo are out of scope here; say so and route them to the user.)

## Bash Command Guidelines

Expand Down
Loading
Loading