diff --git a/README.md b/README.md
index 4a14326..6d2e28a 100644
--- a/README.md
+++ b/README.md
@@ -77,7 +77,7 @@ Full install instructions: [`docs/how-to/install.md`](docs/how-to/install.md). F
 
 ## Benchmarks
 
-We use a canonical prompt — an AI-driven roguelike POC — to spot regressions as the system evolves. See [`benchmarks/`](benchmarks/) for the prompt, expected output shape, and a `run.sh` to re-run it.
+Canonical prompts for regression-spotting as the system evolves live under [`benchmarks/`](benchmarks/). See that directory for the layout convention.
 
 ## Contributing
 
diff --git a/benchmarks/README.md b/benchmarks/README.md
index a5416ca..8510048 100644
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -2,23 +2,26 @@
 
 Canonical prompts we run against the decision-record planning pipeline to catch regressions as the system evolves.
 
-| Benchmark | Prompt | Effort | Purpose |
-|---|---|---|---|
-| [roguelike-ai-poc](roguelike-ai-poc/) | AI-driven roguelike where the agent plays the game | `poc` | Exercises all five pipeline phases on a small, well-bounded problem. The original dogfood case. |
+_(No public benchmarks committed yet. Add new ones as `benchmarks/<name>/` with a `prompt.md`, a `reference/` artifact snapshot, and a `run.sh` runner. See the structure described below.)_
 
-## How to run a benchmark
+## Benchmark layout
+
+Each benchmark lives in its own directory:
+
+```
+benchmarks/<name>/
+├── prompt.md      # the exact idea, effort level, and what "good output" looks like
+├── reference/     # a baseline artifact snapshot from a canonical run
+└── run.sh         # one-shot runner that fires the CLI against a fresh tmp dir
+```
+
+## How to run
 
 ```bash
 cd benchmarks/<name>
 ./run.sh
 ```
 
-Each benchmark has:
-
-- `prompt.md` — the exact idea, effort level, and what "good output" looks like
-- `reference/` — a baseline artifact snapshot from a canonical run
-- `run.sh` — one-shot runner that fires the CLI against a fresh tmp dir
-
 ## What we look for when comparing runs
 
 Each benchmark's `prompt.md` defines its own success criteria. Generally:
diff --git a/benchmarks/roguelike-ai-poc/prompt.md b/benchmarks/roguelike-ai-poc/prompt.md
deleted file mode 100644
index 745bdb9..0000000
--- a/benchmarks/roguelike-ai-poc/prompt.md
+++ /dev/null
@@ -1,63 +0,0 @@
-# Benchmark: roguelike-ai-poc
-
-This is the canonical benchmark for the decision-record planning pipeline. We re-run it as the system evolves to spot regressions in plan quality, gate behavior, agent prompts, and rendering.
-
-## The prompt
-
-**Idea (free-form):**
-
-> A minimal roguelike where the player primes an AI agent with a strategy, then the agent autonomously navigates a single ASCII-rendered room over a tick system until it wins the objective or dies. Goal: prove the agent-as-player concept with the smallest viable surface area.
-
-**Effort level:** `poc`
-
-## Invocation
-
-```bash
-decision-record \
-  --title "AI-driven roguelike POC" \
-  --description "$(cat <<'EOF'
-A minimal roguelike where the player primes an AI agent with a strategy, then the agent autonomously navigates a single ASCII-rendered room over a tick system until it wins the objective or dies. Goal: prove the agent-as-player concept with the smallest viable surface area.
-EOF
-)" \
-  --effort poc \
-  --cwd ./tmp-roguelike-bench \
-  --yes
-```
-
-Or the one-shot wrapper: `./run.sh` (creates a fresh tmp dir, runs the CLI, prints where the artifacts landed).
-
-## What "good output" looks like
-
-A run is healthy if the produced plan:
-
-- **Pipeline reaches `handed-off`** — every gate passes, sign-offs recorded, project finalized.
-- **3-5 significant decisions** are proposed and accepted — language, world representation, agent action contract, tick-loop control. (Not 1; not 12.)
-- **5-8 vertical-slice tasks** — bootstrap → world → renderer → agent client → action handlers → game loop → CLI entry. Every leaf ≤ 16h (poc cap). Every task references at least one accepted DR.
-- **The seed library is consulted** for at least the language decision (`dr_seed_search` + `dr_seed_load` on `language-choice`).
-- **Graph validates clean** — no cycles, no orphan deps, no missing decision refs.
-- **Artifacts emitted** — `dr/project.json`, `dr/decisions/*.json`, `dr/tasks/*.json`, rendered `.md` siblings, `dr/index.html`. `.dr/events.jsonl` contains a coherent audit trail.
-
-## Reference snapshot
-
-`./reference/` holds the artifacts from the canonical run produced by hand-driving the MCP tools (2026-05-16, the dogfood test that originally produced this benchmark). Treat it as a "this is what good looks like" baseline, not a strict equality target — different agent runs will pick slightly different positions, phrasing, and task decomposition, and that's fine.
-
-When comparing a new run against `./reference/`:
-
-- **Same final phase, gate decisions, event mix** → no regression.
-- **More/fewer decisions or tasks** → check whether the new run is denser/sparser appropriately or whether the agent over- or under-decomposed.
-- **Different selected positions** → fine if defensible; concerning if the argument is weaker.
-- **Missing seed usage** → bug or prompt drift; the agent should reach for `language-choice` here.
-- **Tasks without decision refs** → regression. Every task must link to a DR.
-- **Validation failures** → regression. The graph must validate.
-
-## What this benchmark exercises
-
-| Surface | Coverage |
-|---|---|
-| Phase machine | All five transitions: intake → scoping → deciding → decomposing → handing-off → handed-off |
-| Seed library | At least one `dr_seed_load` (language-choice) |
-| Decision lifecycle | propose → update with position + argument → accept (no review under poc preset) |
-| Task graph | Multi-node dependency chain with decision_refs |
-| Gates | `min_tasks=3`, `max_task_estimate_hours=16`, `require_human_signoff_phases=['handing-off']` |
-| Render | Markdown per record + static HTML index |
-| Handoff | Filesystem path (Linear path is exercised by separate live test) |
diff --git a/benchmarks/roguelike-ai-poc/reference/decisions/0001-choose-the-implementation-language.json b/benchmarks/roguelike-ai-poc/reference/decisions/0001-choose-the-implementation-language.json
deleted file mode 100644
index f07d744..0000000
--- a/benchmarks/roguelike-ai-poc/reference/decisions/0001-choose-the-implementation-language.json
+++ /dev/null
@@ -1,115 +0,0 @@
-{
-  "id": "0001-choose-the-implementation-language",
-  "number": 1,
-  "slug": "choose-the-implementation-language",
-  "title": "Choose the implementation language",
-  "status": "accepted",
-  "template_variant": "architecture",
-  "created_at": "2026-05-17T04:13:38.681Z",
-  "updated_at": "2026-05-17T04:13:38.685Z",
-  "summary": "Decide the primary implementation language for the project.",
-  "issue": "Every other foundational decision (runtime, package manager, framework choices, testing tools) flows from the language choice. Picking this early and explicitly avoids drift.",
-  "assumptions": [
-    "Team has existing language strengths to lean on.",
-    "Project lifespan is long enough that hiring and onboarding matter.",
-    "Ecosystem maturity matters for the project's domain."
-  ],
-  "constraints": [
-    "Team's current expertise.",
-    "Target runtime environments (browser, server, native, embedded).",
-    "Performance and memory budgets.",
-    "Licensing or compliance restrictions on language ecosystems."
-  ],
-  "positions": [
-    {
-      "title": "TypeScript",
-      "description": "Strongly typed JavaScript. Best for full-stack web work, ubiquitous tooling.",
-      "pros": [
-        "Ubiquitous in web",
-        "Strong types catch errors early",
-        "Massive ecosystem",
-        "Frontend/backend code sharing"
-      ],
-      "cons": [
-        "Build step overhead",
-        "Type system can be over-engineered",
-        "Slower than native languages for hot paths"
-      ],
-      "links": []
-    },
-    {
-      "title": "Python",
-      "description": "Dynamic, batteries-included. Best for data work, scripting, ML, fast prototypes.",
-      "pros": [
-        "Excellent ML/data ecosystem",
-        "Fast to write",
-        "Readable",
-        "Huge stdlib"
-      ],
-      "cons": [
-        "Slow runtime without C extensions",
-        "GIL limits concurrency",
-        "Dynamic typing → runtime errors"
-      ],
-      "links": []
-    },
-    {
-      "title": "Go",
-      "description": "Statically typed, compiled, built for concurrent services.",
-      "pros": [
-        "Simple language",
-        "Single binary deployment",
-        "Strong concurrency primitives",
-        "Fast compile times"
-      ],
-      "cons": [
-        "Generics still maturing",
-        "Verbose error handling",
-        "Less rich third-party ecosystem than JS/Python"
-      ],
-      "links": []
-    },
-    {
-      "title": "Rust",
-      "description": "Memory-safe systems language. Best for performance-critical or systems work.",
-      "pros": [
-        "No GC, predictable performance",
-        "Memory safety",
-        "Excellent tooling (cargo)",
-        "Strong types"
-      ],
-      "cons": [
-        "Steep learning curve",
-        "Slower to ship initial features",
-        "Compile times can be long"
-      ],
-      "links": []
-    }
-  ],
-  "opinions": [],
-  "argument": "Python is fastest to write for a single-script game-loop POC. The OpenAI SDK + a tiny terminal renderer fit naturally; no build step or transpile loop slows iteration. Team is comfortable with Python and the project never needs to leave a single repo.",
-  "selected_position": "Python",
-  "implications": [
-    "Use the official openai Python SDK for agent calls.",
-    "Single-file or small-module layout; no package manager beyond pip/uv.",
-    "Pin to Python 3.11+ for ergonomic match-statement parsing of agent actions."
-  ],
-  "depends_on": [],
-  "related_decisions": [],
-  "related_artifacts": [],
-  "review": [],
-  "sign_off": {
-    "by": "human",
-    "actor": "kj",
-    "at": "2026-05-17T04:13:38.685Z",
-    "notes": "poc preset, no review required"
-  },
-  "seed_origin": "language-choice",
-  "tags": [
-    "foundation",
-    "poc",
-    "foundation",
-    "architecture",
-    "stack"
-  ]
-}
diff --git a/benchmarks/roguelike-ai-poc/reference/decisions/0001-choose-the-implementation-language.md b/benchmarks/roguelike-ai-poc/reference/decisions/0001-choose-the-implementation-language.md
deleted file mode 100644
index 8a3a4b3..0000000
--- a/benchmarks/roguelike-ai-poc/reference/decisions/0001-choose-the-implementation-language.md
+++ /dev/null
@@ -1,120 +0,0 @@
-# 0001-choose-the-implementation-language — Choose the implementation language
-
-| Field | Value |
-| --- | --- |
-| Status | `accepted` |
-| Template | `architecture` |
-| Updated | 2026-05-17T04:13:38.685Z |
-| Selected | **Python** |
-| Depends on | _(none)_ |
-
-## Summary
-
-Decide the primary implementation language for the project.
-
-## Issue
-
-Every other foundational decision (runtime, package manager, framework choices, testing tools) flows from the language choice. Picking this early and explicitly avoids drift.
-
-## Assumptions
-
-- Team has existing language strengths to lean on.
-- Project lifespan is long enough that hiring and onboarding matter.
-- Ecosystem maturity matters for the project's domain.
-
-## Constraints
-
-- Team's current expertise.
-- Target runtime environments (browser, server, native, embedded).
-- Performance and memory budgets.
-- Licensing or compliance restrictions on language ecosystems.
-
-## Positions
-
-### TypeScript
-
-Strongly typed JavaScript. Best for full-stack web work, ubiquitous tooling.
-
-**Pros**
-
-- Ubiquitous in web
-- Strong types catch errors early
-- Massive ecosystem
-- Frontend/backend code sharing
-
-**Cons**
-
-- Build step overhead
-- Type system can be over-engineered
-- Slower than native languages for hot paths
-
-### Python ✅
-
-Dynamic, batteries-included. Best for data work, scripting, ML, fast prototypes.
-
-**Pros**
-
-- Excellent ML/data ecosystem
-- Fast to write
-- Readable
-- Huge stdlib
-
-**Cons**
-
-- Slow runtime without C extensions
-- GIL limits concurrency
-- Dynamic typing → runtime errors
-
-### Go
-
-Statically typed, compiled, built for concurrent services.
-
-**Pros**
-
-- Simple language
-- Single binary deployment
-- Strong concurrency primitives
-- Fast compile times
-
-**Cons**
-
-- Generics still maturing
-- Verbose error handling
-- Less rich third-party ecosystem than JS/Python
-
-### Rust
-
-Memory-safe systems language. Best for performance-critical or systems work.
-
-**Pros**
-
-- No GC, predictable performance
-- Memory safety
-- Excellent tooling (cargo)
-- Strong types
-
-**Cons**
-
-- Steep learning curve
-- Slower to ship initial features
-- Compile times can be long
-
-## Argument
-
-Python is fastest to write for a single-script game-loop POC. The OpenAI SDK + a tiny terminal renderer fit naturally; no build step or transpile loop slows iteration. Team is comfortable with Python and the project never needs to leave a single repo.
-
-## Implications
-
-- Use the official openai Python SDK for agent calls.
-- Single-file or small-module layout; no package manager beyond pip/uv.
-- Pin to Python 3.11+ for ergonomic match-statement parsing of agent actions.
-
-## Sign-off
-
-- **By:** kj (human)
-- **At:** 2026-05-17T04:13:38.685Z
-- **Notes:** poc preset, no review required
-
----
-
-_Instantiated from seed: `language-choice`_
diff --git a/benchmarks/roguelike-ai-poc/reference/decisions/0002-define-the-world-representation-and-renderer.json b/benchmarks/roguelike-ai-poc/reference/decisions/0002-define-the-world-representation-and-renderer.json
deleted file mode 100644
index 7afe41a..0000000
--- a/benchmarks/roguelike-ai-poc/reference/decisions/0002-define-the-world-representation-and-renderer.json
+++ /dev/null
@@ -1,85 +0,0 @@
-{
-  "id": "0002-define-the-world-representation-and-renderer",
-  "number": 2,
-  "slug": "define-the-world-representation-and-renderer",
-  "title": "Define the world representation and renderer",
-  "status": "accepted",
-  "template_variant": "data-model",
-  "created_at": "2026-05-17T04:13:38.686Z",
-  "updated_at": "2026-05-17T04:13:38.688Z",
-  "summary": "How the room is stored in memory and rendered to the terminal each tick.",
-  "issue": "The world is small (one 10×10 room) but the representation must support: easy frame rendering, fast collision/hazard checks, and a stable serialization that the agent can read on each tick. Pick a model now so the action handlers and renderer can converge.",
-  "assumptions": [
-    "10×10 fixed grid",
-    "Single player entity",
-    "Static tiles set at startup",
-    "Frame fits in a single terminal redraw"
-  ],
-  "constraints": [
-    "Frame must be readable both by humans and the LLM",
-    "No external graphics libraries"
-  ],
-  "positions": [
-    {
-      "title": "Nested list of chars",
-      "description": "world: list[list[str]] indexed by [y][x]. Player position stored separately.",
-      "pros": [
-        "Simplest possible",
-        "Trivial to mutate",
-        "Renders by row-join"
-      ],
-      "cons": [
-        "No type safety on tile semantics",
-        "Have to scan grid for entity positions"
-      ],
-      "links": []
-    },
-    {
-      "title": "Tile-grid + entity dict",
-      "description": "static_tiles: list[list[str]] for walls/floor/hazard/exit; entities: dict[id, {pos, hp, glyph}] overlaid at render time.",
-      "pros": [
-        "Separates static map from dynamic state",
-        "Easy to add entities later if needed",
-        "Clean serialization to JSON"
-      ],
-      "cons": [
-        "Two structures to keep consistent",
-        "Slightly more code"
-      ],
-      "links": []
-    },
-    {
-      "title": "Single 2D numpy array + glyph table",
-      "description": "Each cell is an int; render by mapping ints to glyphs.",
-      "pros": [
-        "Compact",
-        "Fast",
-        "Numpy is familiar"
-      ],
-      "cons": [
-        "Numpy is overkill for 10×10",
-        "Adds a dep we do not otherwise need",
-        "Less Pythonic for tiny data"
-      ],
-      "links": []
-    }
-  ],
-  "opinions": [],
-  "argument": "Static map + entity overlay is the simplest model that survives the day-2 question can we add a second entity? without a rewrite. It serializes naturally to JSON for the LLM payload and keeps render code in one row-join.",
-  "selected_position": "Tile-grid + entity dict",
-  "implications": [
-    "Tile glyphs: # wall, . floor, X hazard, > exit; entities overlay (@ for player).",
-    "Each tick the renderer composes static_tiles + entity glyphs at their positions.",
-    "JSON state sent to the agent: { frame: [<row strings>], hp, tick, exit_pos, player_pos }."
-  ],
-  "depends_on": [],
-  "related_decisions": [],
-  "related_artifacts": [],
-  "review": [],
-  "sign_off": {
-    "by": "human",
-    "actor": "kj",
-    "at": "2026-05-17T04:13:38.688Z"
-  },
-  "tags": []
-}
diff --git a/benchmarks/roguelike-ai-poc/reference/decisions/0002-define-the-world-representation-and-renderer.md b/benchmarks/roguelike-ai-poc/reference/decisions/0002-define-the-world-representation-and-renderer.md
deleted file mode 100644
index dfbf675..0000000
--- a/benchmarks/roguelike-ai-poc/reference/decisions/0002-define-the-world-representation-and-renderer.md
+++ /dev/null
@@ -1,92 +0,0 @@
-# 0002-define-the-world-representation-and-renderer — Define the world representation and renderer
-
-| Field | Value |
-| --- | --- |
-| Status | `accepted` |
-| Template | `data-model` |
-| Updated | 2026-05-17T04:13:38.688Z |
-| Selected | **Tile-grid + entity dict** |
-| Depends on | _(none)_ |
-
-## Summary
-
-How the room is stored in memory and rendered to the terminal each tick.
-
-## Issue
-
-The world is small (one 10×10 room) but the representation must support: easy frame rendering, fast collision/hazard checks, and a stable serialization that the agent can read on each tick. Pick a model now so the action handlers and renderer can converge.
-
-## Assumptions
-
-- 10×10 fixed grid
-- Single player entity
-- Static tiles set at startup
-- Frame fits in a single terminal redraw
-
-## Constraints
-
-- Frame must be readable both by humans and the LLM
-- No external graphics libraries
-
-## Positions
-
-### Nested list of chars
-
-world: list[list[str]] indexed by [y][x]. Player position stored separately.
-
-**Pros**
-
-- Simplest possible
-- Trivial to mutate
-- Renders by row-join
-
-**Cons**
-
-- No type safety on tile semantics
-- Have to scan grid for entity positions
-
-### Tile-grid + entity dict ✅
-
-static_tiles: list[list[str]] for walls/floor/hazard/exit; entities: dict[id, {pos, hp, glyph}] overlaid at render time.
-
-**Pros**
-
-- Separates static map from dynamic state
-- Easy to add entities later if needed
-- Clean serialization to JSON
-
-**Cons**
-
-- Two structures to keep consistent
-- Slightly more code
-
-### Single 2D numpy array + glyph table
-
-Each cell is an int; render by mapping ints to glyphs.
-
-**Pros**
-
-- Compact
-- Fast
-- Numpy is familiar
-
-**Cons**
-
-- Numpy is overkill for 10×10
-- Adds a dep we do not otherwise need
-- Less Pythonic for tiny data
-
-## Argument
-
-Static map + entity overlay is the simplest model that survives the day-2 question can we add a second entity? without a rewrite. It serializes naturally to JSON for the LLM payload and keeps render code in one row-join.
-
-## Implications
-
-- Tile glyphs: # wall, . floor, X hazard, > exit; entities overlay (@ for player).
-- Each tick the renderer composes static_tiles + entity glyphs at their positions.
-- JSON state sent to the agent: { frame: [<row strings>], hp, tick, exit_pos, player_pos }.
-
-## Sign-off
-
-- **By:** kj (human)
-- **At:** 2026-05-17T04:13:38.688Z
diff --git a/benchmarks/roguelike-ai-poc/reference/decisions/0003-define-the-agent-action-contract.json b/benchmarks/roguelike-ai-poc/reference/decisions/0003-define-the-agent-action-contract.json
deleted file mode 100644
index 0e98040..0000000
--- a/benchmarks/roguelike-ai-poc/reference/decisions/0003-define-the-agent-action-contract.json
+++ /dev/null
@@ -1,83 +0,0 @@
-{
-  "id": "0003-define-the-agent-action-contract",
-  "number": 3,
-  "slug": "define-the-agent-action-contract",
-  "title": "Define the agent action contract",
-  "status": "accepted",
-  "template_variant": "architecture",
-  "created_at": "2026-05-17T04:13:38.689Z",
-  "updated_at": "2026-05-17T04:13:38.690Z",
-  "summary": "How the LLM receives the world state per tick and how it returns the chosen action.",
-  "issue": "The agent must produce a structured, validated action every tick. We need the protocol pinned so the game loop never has to guess what the agent meant.",
-  "assumptions": [
-    "OpenAI-compatible API is the LLM transport",
-    "Strategy prompt is supplied once at startup",
-    "Per-tick latency budget ~2-5s is acceptable"
-  ],
-  "constraints": [
-    "Action set is small (move N/S/E/W + noop)",
-    "Agent must not stall the game with malformed output",
-    "Must be debuggable from logs"
-  ],
-  "positions": [
-    {
-      "title": "Plain-text response parsing",
-      "description": "Agent returns N/S/E/W/noop as plain text; we parse first token.",
-      "pros": [
-        "Lowest token cost",
-        "Works with any model"
-      ],
-      "cons": [
-        "Brittle to extra punctuation/prose",
-        "No reasoning surface",
-        "Hard to audit why"
-      ],
-      "links": []
-    },
-    {
-      "title": "Tool-call (function calling) with one tool: do_action(direction)",
-      "description": "Define a single OpenAI tool; agent invokes it once per tick with a strict enum direction.",
-      "pros": [
-        "Schema-validated",
-        "Free reasoning text alongside the call",
-        "Easy to extend with new actions later"
-      ],
-      "cons": [
-        "Slightly more tokens per call",
-        "Requires a model that supports function calling"
-      ],
-      "links": []
-    },
-    {
-      "title": "JSON-only response with output_config",
-      "description": "Force agent to emit {\"action\":\"N\",\"reason\":\"…\"} via structured outputs.",
-      "pros": [
-        "Schema-validated",
-        "Reasoning captured in same payload"
-      ],
-      "cons": [
-        "Some providers do not honor strict mode",
-        "Slightly more setup than tool-call"
-      ],
-      "links": []
-    }
-  ],
-  "opinions": [],
-  "argument": "Tool-calling is the cleanest contract: the model gets free-form reasoning in `content` AND a strict-enum action in `tool_calls`. We can log both, and extending to new actions later is just adding enum values. Plain-text parsing trades 100 tokens of savings for a constant brittleness tax.",
-  "selected_position": "Tool-call (function calling) with one tool: do_action(direction)",
-  "implications": [
-    "Define tool `do_action` with input_schema requiring `direction` in {N,S,E,W,noop}.",
-    "Use tool_choice=\"required\" each tick to force a call.",
-    "Log the assistant message text (the reasoning) alongside the chosen direction for replay/debug."
-  ],
-  "depends_on": [],
-  "related_decisions": [],
-  "related_artifacts": [],
-  "review": [],
-  "sign_off": {
-    "by": "human",
-    "actor": "kj",
-    "at": "2026-05-17T04:13:38.690Z"
-  },
-  "tags": []
-}
diff --git a/benchmarks/roguelike-ai-poc/reference/decisions/0003-define-the-agent-action-contract.md b/benchmarks/roguelike-ai-poc/reference/decisions/0003-define-the-agent-action-contract.md
deleted file mode 100644
index 1bd6e3a..0000000
--- a/benchmarks/roguelike-ai-poc/reference/decisions/0003-define-the-agent-action-contract.md
+++ /dev/null
@@ -1,90 +0,0 @@
-# 0003-define-the-agent-action-contract — Define the agent action contract
-
-| Field | Value |
-| --- | --- |
-| Status | `accepted` |
-| Template | `architecture` |
-| Updated | 2026-05-17T04:13:38.690Z |
-| Selected | **Tool-call (function calling) with one tool: do_action(direction)** |
-| Depends on | _(none)_ |
-
-## Summary
-
-How the LLM receives the world state per tick and how it returns the chosen action.
-
-## Issue
-
-The agent must produce a structured, validated action every tick. We need the protocol pinned so the game loop never has to guess what the agent meant.
-
-## Assumptions
-
-- OpenAI-compatible API is the LLM transport
-- Strategy prompt is supplied once at startup
-- Per-tick latency budget ~2-5s is acceptable
-
-## Constraints
-
-- Action set is small (move N/S/E/W + noop)
-- Agent must not stall the game with malformed output
-- Must be debuggable from logs
-
-## Positions
-
-### Plain-text response parsing
-
-Agent returns N/S/E/W/noop as plain text; we parse first token.
-
-**Pros**
-
-- Lowest token cost
-- Works with any model
-
-**Cons**
-
-- Brittle to extra punctuation/prose
-- No reasoning surface
-- Hard to audit why
-
-### Tool-call (function calling) with one tool: do_action(direction) ✅
-
-Define a single OpenAI tool; agent invokes it once per tick with a strict enum direction.
-
-**Pros**
-
-- Schema-validated
-- Free reasoning text alongside the call
-- Easy to extend with new actions later
-
-**Cons**
-
-- Slightly more tokens per call
-- Requires a model that supports function calling
-
-### JSON-only response with output_config
-
-Force agent to emit {"action":"N","reason":"…"} via structured outputs.
-
-**Pros**
-
-- Schema-validated
-- Reasoning captured in same payload
-
-**Cons**
-
-- Some providers do not honor strict mode
-- Slightly more setup than tool-call
-
-## Argument
-
-Tool-calling is the cleanest contract: the model gets free-form reasoning in `content` AND a strict-enum action in `tool_calls`. We can log both, and extending to new actions later is just adding enum values. Plain-text parsing trades 100 tokens of savings for a constant brittleness tax.
-
-## Implications
-
-- Define tool `do_action` with input_schema requiring `direction` in {N,S,E,W,noop}.
-- Use tool_choice="required" each tick to force a call.
-- Log the assistant message text (the reasoning) alongside the chosen direction for replay/debug.
-
-## Sign-off
-
-- **By:** kj (human)
-- **At:** 2026-05-17T04:13:38.690Z
diff --git a/benchmarks/roguelike-ai-poc/reference/decisions/0004-define-the-tick-loop-and-termination-conditions.json b/benchmarks/roguelike-ai-poc/reference/decisions/0004-define-the-tick-loop-and-termination-conditions.json
deleted file mode 100644
index 4f6becd..0000000
--- a/benchmarks/roguelike-ai-poc/reference/decisions/0004-define-the-tick-loop-and-termination-conditions.json
+++ /dev/null
@@ -1,68 +0,0 @@
-{
-  "id": "0004-define-the-tick-loop-and-termination-conditions",
-  "number": 4,
-  "slug": "define-the-tick-loop-and-termination-conditions",
-  "title": "Define the tick loop and termination conditions",
-  "status": "accepted",
-  "template_variant": "architecture",
-  "created_at": "2026-05-17T04:13:38.691Z",
-  "updated_at": "2026-05-17T04:13:38.692Z",
-  "summary": "How the game advances tick by tick, when it stops, and how the user observes it.",
-  "issue": "With an LLM in the loop, each tick is slow (~2-5s). We need a predictable loop with hard stops so the POC always terminates and is always watchable.",
-  "assumptions": [
-    "One-player synchronous game",
-    "User runs the script in a terminal and watches frames",
-    "LLM calls happen on the same thread"
-  ],
-  "constraints": [
-    "Must terminate on win, death, or step limit",
-    "Frame must visibly update each tick",
-    "Must not deadlock on a stuck agent"
-  ],
-  "positions": [
-    {
-      "title": "Synchronous loop with step cap",
-      "description": "while not terminal: render → ask agent → apply → check win/death. Hard cap at N steps (e.g., 50).",
-      "pros": [
-        "Simplest mental model",
-        "Easy to log",
-        "Predictable termination"
-      ],
-      "cons": [
-        "UI freezes during LLM call (acceptable for POC)"
-      ],
-      "links": []
-    },
-    {
-      "title": "Async loop with timeout per tick",
-      "description": "Wrap each agent call in a 10s timeout; on timeout, treat as noop.",
-      "pros": [
-        "Robust to slow API",
-        "Game keeps moving"
-      ],
-      "cons": [
-        "More complex",
-        "Asyncio inside a CLI script is heavier than warranted"
-      ],
-      "links": []
-    }
-  ],
-  "opinions": [],
-  "argument": "For a single-window terminal demo, synchronous is fine. Adding asyncio doubles the code size for no demo-visible benefit. The step cap protects against an agent that wanders forever and ensures every run terminates.",
-  "selected_position": "Synchronous loop with step cap",
-  "implications": [
-    "Step cap = 50; on cap, exit with status \"timeout\" and final HP.",
-    "Use time.sleep(0.05) after each render so the user can see the frames advance.",
-    "Loop logs each tick to stdout: frame, action, reasoning, hp, tick#."
-  ],
-  "depends_on": [],
-  "related_decisions": [],
-  "related_artifacts": [],
-  "review": [],
-  "sign_off": {
-    "by": "human",
-    "actor": "kj",
-    "at": "2026-05-17T04:13:38.692Z"
-  },
-  "tags": []
-}
diff --git a/benchmarks/roguelike-ai-poc/reference/decisions/0004-define-the-tick-loop-and-termination-conditions.md b/benchmarks/roguelike-ai-poc/reference/decisions/0004-define-the-tick-loop-and-termination-conditions.md
deleted file mode 100644
index 0d83a25..0000000
--- a/benchmarks/roguelike-ai-poc/reference/decisions/0004-define-the-tick-loop-and-termination-conditions.md
+++ /dev/null
@@ -1,74 +0,0 @@
-# 0004-define-the-tick-loop-and-termination-conditions — Define the tick loop and termination conditions
-
-| Field | Value |
-| --- | --- |
-| Status | `accepted` |
-| Template | `architecture` |
-| Updated | 2026-05-17T04:13:38.692Z |
-| Selected | **Synchronous loop with step cap** |
-| Depends on | _(none)_ |
-
-## Summary
-
-How the game advances tick by tick, when it stops, and how the user observes it.
-
-## Issue
-
-With an LLM in the loop, each tick is slow (~2-5s). We need a predictable loop with hard stops so the POC always terminates and is always watchable.
-
-## Assumptions
-
-- One-player synchronous game
-- User runs the script in a terminal and watches frames
-- LLM calls happen on the same thread
-
-## Constraints
-
-- Must terminate on win, death, or step limit
-- Frame must visibly update each tick
-- Must not deadlock on a stuck agent
-
-## Positions
-
-### Synchronous loop with step cap ✅
-
-while not terminal: render → ask agent → apply → check win/death. Hard cap at N steps (e.g., 50).
-
-**Pros**
-
-- Simplest mental model
-- Easy to log
-- Predictable termination
-
-**Cons**
-
-- UI freezes during LLM call (acceptable for POC)
-
-### Async loop with timeout per tick
-
-Wrap each agent call in a 10s timeout; on timeout, treat as noop.
-
-**Pros**
-
-- Robust to slow API
-- Game keeps moving
-
-**Cons**
-
-- More complex
-- Asyncio inside a CLI script is heavier than warranted
-
-## Argument
-
-For a single-window terminal demo, synchronous is fine. Adding asyncio doubles the code size for no demo-visible benefit. The step cap protects against an agent that wanders forever and ensures every run terminates.
-
-## Implications
-
-- Step cap = 50; on cap, exit with status "timeout" and final HP.
-- Use time.sleep(0.05) after each render so the user can see the frames advance.
-- Loop logs each tick to stdout: frame, action, reasoning, hp, tick#.
-
-## Sign-off
-
-- **By:** kj (human)
-- **At:** 2026-05-17T04:13:38.692Z
diff --git a/benchmarks/roguelike-ai-poc/reference/events.jsonl b/benchmarks/roguelike-ai-poc/reference/events.jsonl
deleted file mode 100644
index 42ab62f..0000000
--- a/benchmarks/roguelike-ai-poc/reference/events.jsonl
+++ /dev/null
@@ -1,33 +0,0 @@
-{"at":"2026-05-17T04:12:02.030Z","actor":"agent","kind":"project_initialized","entity_kind":"project","entity_id":"ai-driven-roguelike-poc","payload":{"effort_level":"poc"}}
-{"at":"2026-05-17T04:12:40.988Z","actor":"agent","kind":"phase_advanced","entity_kind":"phase","entity_id":"scoping","payload":{"from":"intake","to":"scoping"}}
-{"at":"2026-05-17T04:12:40.991Z","actor":"agent","kind":"scope_updated","entity_kind":"project","entity_id":"ai-driven-roguelike-poc","payload":{"scope":{"in_scope":["A 10×10 ASCII-rendered single room with walls (#), floor (.), player (@), exit (>), and a hazard tile (X)","Tick-based game loop: each tick prints the frame, then queries the agent for one action","A small action vocabulary: move N/S/E/W and noop","Player has HP; stepping on hazard removes HP; reaching exit = win, HP=0 = death","Strategy prompt provided once at startup, fed to the agent as system prompt for every tick","LLM agent receives current frame + HP + tick number, returns a single action"],"out_of_scope":["Multiple rooms, dungeon generation, procedural levels","Combat with enemies, NPCs, monsters","Inventory, items, equipment","Save/load, persistence","Visual UI beyond ASCII to terminal","Multiplayer, networking","Self-improving agent loops or RL training"],"success_criteria":["A user can run a single command, supply a strategy prompt, and watch the agent play until win or death","Win and death paths both observed in manual playtests","Different strategy prompts produce visibly different agent behavior","End-to-end run completes in under 60 seconds wall time on a typical OpenAI API call"],"nice_to_have":["Configurable room layout from a text file","Replay log written to disk for post-hoc inspection","A few preset strategy prompts to demo (cautious, greedy, exploratory)"]}}}
-{"at":"2026-05-17T04:12:40.991Z","actor":"agent","kind":"phase_advanced","entity_kind":"phase","entity_id":"deciding","payload":{"from":"scoping","to":"deciding"}}
-{"at":"2026-05-17T04:13:38.681Z","actor":"agent","kind":"seed_loaded","entity_kind":"decision","entity_id":"0001-choose-the-implementation-language","payload":{"seed_name":"language-choice"}}
-{"at":"2026-05-17T04:13:38.684Z","actor":"agent","kind":"decision_updated","entity_kind":"decision","entity_id":"0001-choose-the-implementation-language","payload":{"changed":["argument","selected_position","implications"]}}
-{"at":"2026-05-17T04:13:38.685Z","actor":"human","actor_name":"kj","kind":"decision_accepted","entity_kind":"decision","entity_id":"0001-choose-the-implementation-language"}
-{"at":"2026-05-17T04:13:38.686Z","actor":"agent","kind":"decision_proposed","entity_kind":"decision","entity_id":"0002-define-the-world-representation-and-renderer","payload":{"template_variant":"data-model"}}
-{"at":"2026-05-17T04:13:38.687Z","actor":"agent","kind":"decision_updated","entity_kind":"decision","entity_id":"0002-define-the-world-representation-and-renderer","payload":{"changed":["argument","selected_position","implications"]}}
-{"at":"2026-05-17T04:13:38.688Z","actor":"human","actor_name":"kj","kind":"decision_accepted","entity_kind":"decision","entity_id":"0002-define-the-world-representation-and-renderer"}
-{"at":"2026-05-17T04:13:38.689Z","actor":"agent","kind":"decision_proposed","entity_kind":"decision","entity_id":"0003-define-the-agent-action-contract","payload":{"template_variant":"architecture"}}
-{"at":"2026-05-17T04:13:38.689Z","actor":"agent","kind":"decision_updated","entity_kind":"decision","entity_id":"0003-define-the-agent-action-contract","payload":{"changed":["argument","selected_position","implications"]}}
-{"at":"2026-05-17T04:13:38.690Z","actor":"human","actor_name":"kj","kind":"decision_accepted","entity_kind":"decision","entity_id":"0003-define-the-agent-action-contract"}
-{"at":"2026-05-17T04:13:38.691Z","actor":"agent","kind":"decision_proposed","entity_kind":"decision","entity_id":"0004-define-the-tick-loop-and-termination-conditions","payload":{"template_variant":"architecture"}}
-{"at":"2026-05-17T04:13:38.692Z","actor":"agent","kind":"decision_updated","entity_kind":"decision","entity_id":"0004-define-the-tick-loop-and-termination-conditions","payload":{"changed":["argument","selected_position","implications"]}}
-{"at":"2026-05-17T04:13:38.692Z","actor":"human","actor_name":"kj","kind":"decision_accepted","entity_kind":"decision","entity_id":"0004-define-the-tick-loop-and-termination-conditions"}
-{"at":"2026-05-17T04:13:38.694Z","actor":"agent","kind":"phase_advanced","entity_kind":"phase","entity_id":"decomposing","payload":{"from":"deciding","to":"decomposing"}}
-{"at":"2026-05-17T04:14:22.524Z","actor":"agent","kind":"task_proposed","entity_kind":"task","entity_id":"T0001-bootstrap-repository","payload":{"decision_refs":["0001-choose-the-implementation-language"],"depends_on":[]}}
-{"at":"2026-05-17T04:14:22.526Z","actor":"agent","kind":"task_proposed","entity_kind":"task","entity_id":"T0002-implement-world-module-tile-grid-entity-dict","payload":{"decision_refs":["0002-define-the-world-representation-and-renderer"],"depends_on":["T0001-bootstrap-repository"]}}
-{"at":"2026-05-17T04:14:22.527Z","actor":"agent","kind":"task_proposed","entity_kind":"task","entity_id":"T0003-implement-frame-renderer","payload":{"decision_refs":["0002-define-the-world-representation-and-renderer"],"depends_on":["T0002-implement-world-module-tile-grid-entity-dict"]}}
-{"at":"2026-05-17T04:14:22.528Z","actor":"agent","kind":"task_proposed","entity_kind":"task","entity_id":"T0004-implement-openai-agent-client","payload":{"decision_refs":["0003-define-the-agent-action-contract"],"depends_on":["T0001-bootstrap-repository"]}}
-{"at":"2026-05-17T04:14:22.529Z","actor":"agent","kind":"task_proposed","entity_kind":"task","entity_id":"T0005-implement-action-handlers-and-termination-checks","payload":{"decision_refs":["0002-define-the-world-representation-and-renderer"],"depends_on":["T0002-implement-world-module-tile-grid-entity-dict"]}}
-{"at":"2026-05-17T04:14:22.530Z","actor":"agent","kind":"task_proposed","entity_kind":"task","entity_id":"T0006-implement-the-tick-based-game-loop","payload":{"decision_refs":["0004-define-the-tick-loop-and-termination-conditions","0002-define-the-world-representation-and-renderer"],"depends_on":["T0003-implement-frame-renderer","T0004-implement-openai-agent-client","T0005-implement-action-handlers-and-termination-checks"]}}
-{"at":"2026-05-17T04:14:22.532Z","actor":"agent","kind":"task_proposed","entity_kind":"task","entity_id":"T0007-implement-cli-entry-script","payload":{"decision_refs":["0001-choose-the-implementation-language","0004-define-the-tick-loop-and-termination-conditions"],"depends_on":["T0006-implement-the-tick-based-game-loop"]}}
-{"at":"2026-05-17T04:14:22.534Z","actor":"agent","kind":"graph_validated","payload":{"valid":true,"task_count":7,"error_count":0,"warning_count":0}}
-{"at":"2026-05-17T04:14:30.972Z","actor":"agent","kind":"graph_validated","payload":{"valid":true,"task_count":7,"error_count":0,"warning_count":0}}
-{"at":"2026-05-17T04:14:37.477Z","actor":"agent","kind":"graph_validated","payload":{"valid":true,"task_count":7,"error_count":0,"warning_count":0}}
-{"at":"2026-05-17T04:14:44.523Z","actor":"human","actor_name":"kj","kind":"phase_advanced","entity_kind":"phase","entity_id":"handing-off","payload":{"from":"decomposing","to":"handing-off","notes":"All decisions accepted, graph validates clean."}}
-{"at":"2026-05-17T04:14:44.523Z","actor":"human","actor_name":"kj","kind":"sign_off_recorded","entity_kind":"phase","entity_id":"handing-off"}
-{"at":"2026-05-17T04:14:44.538Z","actor":"agent","kind":"render_run","payload":{"decisions":4,"tasks":7}}
-{"at":"2026-05-17T04:14:44.540Z","actor":"human","actor_name":"kj","kind":"export_started","entity_kind":"project","entity_id":"ai-driven-roguelike-poc","payload":{"target":"filesystem"}}
-{"at":"2026-05-17T04:14:44.540Z","actor":"human","actor_name":"kj","kind":"export_completed","entity_kind":"project","entity_id":"ai-driven-roguelike-poc","payload":{"target":"filesystem","issue_count":7,"document_count":4}}
-{"at":"2026-05-17T04:14:44.544Z","actor":"agent","kind":"render_run","payload":{"decisions":4,"tasks":7}}
diff --git a/benchmarks/roguelike-ai-poc/reference/index.html b/benchmarks/roguelike-ai-poc/reference/index.html
deleted file mode 100644
index 75276fc..0000000
--- a/benchmarks/roguelike-ai-poc/reference/index.html
+++ /dev/null
@@ -1,231 +0,0 @@
-<!doctype html>
-<html lang="en">
-<head>
-<meta charset="utf-8">
-<meta name="viewport" content="width=device-width,initial-scale=1">
-<title>AI-driven roguelike POC — Decision Record</title>
-<style>:root {
-  --bg: #fafafa;
-  --fg: #1a1a1a;
-  --muted: #6b7280;
-  --border: #e5e7eb;
-  --accent: #4f46e5;
-  --status-rfc: #fbbf24;
-  --status-proposed: #60a5fa;
-  --status-accepted: #34d399;
-  --status-rejected: #f87171;
-  --status-deprecated: #9ca3af;
-  --status-superseded: #c084fc;
-  --task-open: #9ca3af;
-  --task-ready: #60a5fa;
-  --task-in_progress: #fbbf24;
-  --task-done: #34d399;
-  --task-blocked: #f87171;
-  --task-deferred: #c084fc;
-}
-* { box-sizing: border-box; }
-body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif; line-height: 1.5; margin: 0; padding: 2rem; background: var(--bg); color: var(--fg); }
-h1, h2, h3 { margin-top: 1.5rem; }
-.container { max-width: 1100px; margin: 0 auto; }
-.header { border-bottom: 1px solid var(--border); padding-bottom: 1rem; margin-bottom: 1.5rem; }
-.meta { display: flex; flex-wrap: wrap; gap: 0.5rem 1rem; color: var(--muted); font-size: 0.9rem; }
-.meta b { color: var(--fg); }
-.pill { display: inline-block; padding: 0.15rem 0.6rem; border-radius: 999px; font-size: 0.75rem; font-weight: 600; color: white; }
-.pill-rfc { background: var(--status-rfc); }
-.pill-proposed { background: var(--status-proposed); }
-.pill-accepted { background: var(--status-accepted); }
-.pill-rejected { background: var(--status-rejected); }
-.pill-deprecated { background: var(--status-deprecated); }
-.pill-superseded { background: var(--status-superseded); }
-.pill-task-open { background: var(--task-open); }
-.pill-task-ready { background: var(--task-ready); }
-.pill-task-in_progress { background: var(--task-in_progress); }
-.pill-task-done { background: var(--task-done); }
-.pill-task-blocked { background: var(--task-blocked); }
-.pill-task-deferred { background: var(--task-deferred); }
-table { width: 100%; border-collapse: collapse; margin-top: 1rem; background: white; box-shadow: 0 1px 3px rgba(0,0,0,0.05); }
-th, td { text-align: left; padding: 0.5rem 0.75rem; border-bottom: 1px solid var(--border); font-size: 0.9rem; vertical-align: top; }
-th { background: #f3f4f6; font-weight: 600; font-size: 0.8rem; text-transform: uppercase; letter-spacing: 0.04em; color: var(--muted); }
-tr:last-child td { border-bottom: none; }
-.scope { background: white; border: 1px solid var(--border); border-radius: 8px; padding: 1rem; margin-top: 1rem; }
-.scope-list { display: grid; grid-template-columns: repeat(auto-fit, minmax(220px, 1fr)); gap: 1rem; }
-.scope-list section { background: #f9fafb; padding: 0.75rem; border-radius: 6px; }
-.scope-list h4 { margin: 0 0 0.5rem; font-size: 0.85rem; color: var(--muted); text-transform: uppercase; letter-spacing: 0.04em; }
-.scope-list ul { margin: 0; padding-left: 1.25rem; }
-.scope-list li { margin: 0.15rem 0; font-size: 0.9rem; }
-.empty { color: var(--muted); font-style: italic; }
-a { color: var(--accent); text-decoration: none; }
-a:hover { text-decoration: underline; }
-.dep-list { color: var(--muted); font-size: 0.8rem; }
-.code { font-family: ui-monospace, "SF Mono", monospace; font-size: 0.85em; background: #f3f4f6; padding: 0.1rem 0.4rem; border-radius: 4px; }
-.handoff { background: #eef2ff; border: 1px solid #c7d2fe; border-radius: 8px; padding: 1rem; margin-top: 1rem; }
-.handoff h3 { margin-top: 0; color: var(--accent); }
-.footer { margin-top: 3rem; padding-top: 1rem; border-top: 1px solid var(--border); color: var(--muted); font-size: 0.8rem; }</style>
-</head>
-<body>
-<div class="container">
-
-  <header class="header">
-    <div class="meta"><span class="code">ai-driven-roguelike-poc</span></div>
-    <h1>AI-driven roguelike POC</h1>
-    <div class="meta">
-      <span><b>Phase:</b> <span class="code">handed-off</span></span>
-      <span><b>Effort:</b> <span class="code">poc</span></span>
-      <span><b>Updated:</b> 2026-05-17T04:14:44.540Z</span>
-      <span><b>Decisions:</b> 4 (4 accepted)</span>
-      <span><b>Tasks:</b> 7 (0 done)</span>
-    </div>
-  </header>
-
-  <p>A minimal roguelike where the player primes an AI agent with a strategy, then the agent autonomously navigates a single ASCII-rendered room over a tick system until it wins the objective or dies. Goal: prove the agent-as-player concept with the smallest viable surface area.</p>
-
-  <div class="scope">
-    <h3>Scope</h3>
-    <div class="scope-list">
-      <section>
-        <h4>In scope</h4>
-        <ul><li>A 10×10 ASCII-rendered single room with walls (#), floor (.), player (@), exit (&gt;), and a hazard tile (X)</li><li>Tick-based game loop: each tick prints the frame, then queries the agent for one action</li><li>A small action vocabulary: move N/S/E/W and noop</li><li>Player has HP; stepping on hazard removes HP; reaching exit = win, HP=0 = death</li><li>Strategy prompt provided once at startup, fed to the agent as system prompt for every tick</li><li>LLM agent receives current frame + HP + tick number, returns a single action</li></ul>
-      </section><section>
-        <h4>Success criteria</h4>
-        <ul><li>A user can run a single command, supply a strategy prompt, and watch the agent play until win or death</li><li>Win and death paths both observed in manual playtests</li><li>Different strategy prompts produce visibly different agent behavior</li><li>End-to-end run completes in under 60 seconds wall time on a typical OpenAI API call</li></ul>
-      </section><section>
-        <h4>Out of scope</h4>
-        <ul><li>Multiple rooms, dungeon generation, procedural levels</li><li>Combat with enemies, NPCs, monsters</li><li>Inventory, items, equipment</li><li>Save/load, persistence</li><li>Visual UI beyond ASCII to terminal</li><li>Multiplayer, networking</li><li>Self-improving agent loops or RL training</li></ul>
-      </section><section>
-        <h4>Nice to have</h4>
-        <ul><li>Configurable room layout from a text file</li><li>Replay log written to disk for post-hoc inspection</li><li>A few preset strategy prompts to demo (cautious, greedy, exploratory)</li></ul>
-      </section>
-    </div>
-  </div>
-  <div class="handoff">
-    <h3>Handed off</h3>
-    <div class="meta">
-      <span><b>Target:</b> <span class="code">filesystem</span></span>
-      <span><b>At:</b> 2026-05-17T04:14:44.540Z</span>
-      
-      
-    </div>
-  </div>
-
-  <h2>Decisions</h2>
-  <table>
-    <thead>
-      <tr>
-        <th>ID</th>
-        <th>Title</th>
-        <th>Status</th>
-        <th>Selected</th>
-        <th>Depends on</th>
-      </tr>
-    </thead>
-    <tbody>
-      <tr>
-        <td><a href="decisions/0001-choose-the-implementation-language.md"><span class="code">0001-choose-the-implementation-language</span></a></td>
-        <td>Choose the implementation language <span class="dep-list">[architecture]</span></td>
-        <td><span class="pill pill-accepted">accepted</span></td>
-        <td>Python</td>
-        <td><span class="empty">—</span></td>
-      </tr><tr>
-        <td><a href="decisions/0002-define-the-world-representation-and-renderer.md"><span class="code">0002-define-the-world-representation-and-renderer</span></a></td>
-        <td>Define the world representation and renderer <span class="dep-list">[data-model]</span></td>
-        <td><span class="pill pill-accepted">accepted</span></td>
-        <td>Tile-grid + entity dict</td>
-        <td><span class="empty">—</span></td>
-      </tr><tr>
-        <td><a href="decisions/0003-define-the-agent-action-contract.md"><span class="code">0003-define-the-agent-action-contract</span></a></td>
-        <td>Define the agent action contract <span class="dep-list">[architecture]</span></td>
-        <td><span class="pill pill-accepted">accepted</span></td>
-        <td>Tool-call (function calling) with one tool: do_action(direction)</td>
-        <td><span class="empty">—</span></td>
-      </tr><tr>
-        <td><a href="decisions/0004-define-the-tick-loop-and-termination-conditions.md"><span class="code">0004-define-the-tick-loop-and-termination-conditions</span></a></td>
-        <td>Define the tick loop and termination conditions <span class="dep-list">[architecture]</span></td>
-        <td><span class="pill pill-accepted">accepted</span></td>
-        <td>Synchronous loop with step cap</td>
-        <td><span class="empty">—</span></td>
-      </tr>
-    </tbody>
-  </table>
-
-  <h2>Task graph</h2>
-  <table>
-    <thead>
-      <tr>
-        <th>ID</th>
-        <th>Title</th>
-        <th>Status</th>
-        <th>Pri</th>
-        <th>Estimate</th>
-        <th>Depends on</th>
-        <th>Decision refs</th>
-      </tr>
-    </thead>
-    <tbody>
-      <tr>
-        <td><a href="tasks/T0001-bootstrap-repository.md"><span class="code">T0001-bootstrap-repository</span></a></td>
-        <td>Bootstrap repository</td>
-        <td><span class="pill pill-task-ready">ready</span></td>
-        <td><span class="code">p0</span></td>
-        <td>1h</td>
-        <td><span class="empty">—</span></td>
-        <td><a href="decisions/0001-choose-the-implementation-language.md" title="Choose the implementation language"><span class="code">0001-choose-the-implementation-language</span></a></td>
-      </tr><tr>
-        <td><a href="tasks/T0002-implement-world-module-tile-grid-entity-dict.md"><span class="code">T0002-implement-world-module-tile-grid-entity-dict</span></a></td>
-        <td>Implement world module (tile grid + entity dict)</td>
-        <td><span class="pill pill-task-open">open</span></td>
-        <td><span class="code">p0</span></td>
-        <td>2h</td>
-        <td><span class="code">T0001-bootstrap-repository</span></td>
-        <td><a href="decisions/0002-define-the-world-representation-and-renderer.md" title="Define the world representation and renderer"><span class="code">0002-define-the-world-representation-and-renderer</span></a></td>
-      </tr><tr>
-        <td><a href="tasks/T0003-implement-frame-renderer.md"><span class="code">T0003-implement-frame-renderer</span></a></td>
-        <td>Implement frame renderer</td>
-        <td><span class="pill pill-task-open">open</span></td>
-        <td><span class="code">p0</span></td>
-        <td>1h</td>
-        <td><span class="code">T0002-implement-world-module-tile-grid-entity-dict</span></td>
-        <td><a href="decisions/0002-define-the-world-representation-and-renderer.md" title="Define the world representation and renderer"><span class="code">0002-define-the-world-representation-and-renderer</span></a></td>
-      </tr><tr>
-        <td><a href="tasks/T0004-implement-openai-agent-client.md"><span class="code">T0004-implement-openai-agent-client</span></a></td>
-        <td>Implement OpenAI agent client</td>
-        <td><span class="pill pill-task-open">open</span></td>
-        <td><span class="code">p0</span></td>
-        <td>2h</td>
-        <td><span class="code">T0001-bootstrap-repository</span></td>
-        <td><a href="decisions/0003-define-the-agent-action-contract.md" title="Define the agent action contract"><span class="code">0003-define-the-agent-action-contract</span></a></td>
-      </tr><tr>
-        <td><a href="tasks/T0005-implement-action-handlers-and-termination-checks.md"><span class="code">T0005-implement-action-handlers-and-termination-checks</span></a></td>
-        <td>Implement action handlers and termination checks</td>
-        <td><span class="pill pill-task-open">open</span></td>
-        <td><span class="code">p0</span></td>
-        <td>1h</td>
-        <td><span class="code">T0002-implement-world-module-tile-grid-entity-dict</span></td>
-        <td><a href="decisions/0002-define-the-world-representation-and-renderer.md" title="Define the world representation and renderer"><span class="code">0002-define-the-world-representation-and-renderer</span></a></td>
-      </tr><tr>
-        <td><a href="tasks/T0006-implement-the-tick-based-game-loop.md"><span class="code">T0006-implement-the-tick-based-game-loop</span></a></td>
-        <td>Implement the tick-based game loop</td>
-        <td><span class="pill pill-task-open">open</span></td>
-        <td><span class="code">p0</span></td>
-        <td>2h</td>
-        <td><span class="code">T0003-implement-frame-renderer</span> <span class="code">T0004-implement-openai-agent-client</span> <span class="code">T0005-implement-action-handlers-and-termination-checks</span></td>
-        <td><a href="decisions/0004-define-the-tick-loop-and-termination-conditions.md" title="Define the tick loop and termination conditions"><span class="code">0004-define-the-tick-loop-and-termination-conditions</span></a> <a href="decisions/0002-define-the-world-representation-and-renderer.md" title="Define the world representation and renderer"><span class="code">0002-define-the-world-representation-and-renderer</span></a></td>
-      </tr><tr>
-        <td><a href="tasks/T0007-implement-cli-entry-script.md"><span class="code">T0007-implement-cli-entry-script</span></a></td>
-        <td>Implement CLI entry script</td>
-        <td><span class="pill pill-task-open">open</span></td>
-        <td><span class="code">p0</span></td>
-        <td>1h</td>
-        <td><span class="code">T0006-implement-the-tick-based-game-loop</span></td>
-        <td><a href="decisions/0001-choose-the-implementation-language.md" title="Choose the implementation language"><span class="code">0001-choose-the-implementation-language</span></a> <a href="decisions/0004-define-the-tick-loop-and-termination-conditions.md" title="Define the tick loop and termination conditions"><span class="code">0004-define-the-tick-loop-and-termination-conditions</span></a></td>
-      </tr>
-    </tbody>
-  </table>
-
-  <footer class="footer">
-    Generated by <a href="https://github.com/protoLabsAI/decision-record">decision-record</a> ·
-    Last render: 2026-05-17T04:14:44.544Z
-  </footer>
-
-</div>
-</body>
-</html>
\ No newline at end of file
diff --git a/benchmarks/roguelike-ai-poc/reference/project.json b/benchmarks/roguelike-ai-poc/reference/project.json
deleted file mode 100644
index 3b4c9fb..0000000
--- a/benchmarks/roguelike-ai-poc/reference/project.json
+++ /dev/null
@@ -1,64 +0,0 @@
-{
-  "id": "ai-driven-roguelike-poc",
-  "title": "AI-driven roguelike POC",
-  "description": "A minimal roguelike where the player primes an AI agent with a strategy, then the agent autonomously navigates a single ASCII-rendered room over a tick system until it wins the objective or dies. Goal: prove the agent-as-player concept with the smallest viable surface area.",
-  "created_at": "2026-05-17T04:12:02.030Z",
-  "updated_at": "2026-05-17T04:14:44.540Z",
-  "effort_level": "poc",
-  "status": "handed-off",
-  "scope": {
-    "in_scope": [
-      "A 10×10 ASCII-rendered single room with walls (#), floor (.), player (@), exit (>), and a hazard tile (X)",
-      "Tick-based game loop: each tick prints the frame, then queries the agent for one action",
-      "A small action vocabulary: move N/S/E/W and noop",
-      "Player has HP; stepping on hazard removes HP; reaching exit = win, HP=0 = death",
-      "Strategy prompt provided once at startup, fed to the agent as system prompt for every tick",
-      "LLM agent receives current frame + HP + tick number, returns a single action"
-    ],
-    "out_of_scope": [
-      "Multiple rooms, dungeon generation, procedural levels",
-      "Combat with enemies, NPCs, monsters",
-      "Inventory, items, equipment",
-      "Save/load, persistence",
-      "Visual UI beyond ASCII to terminal",
-      "Multiplayer, networking",
-      "Self-improving agent loops or RL training"
-    ],
-    "success_criteria": [
-      "A user can run a single command, supply a strategy prompt, and watch the agent play until win or death",
-      "Win and death paths both observed in manual playtests",
-      "Different strategy prompts produce visibly different agent behavior",
-      "End-to-end run completes in under 60 seconds wall time on a typical OpenAI API call"
-    ],
-    "nice_to_have": [
-      "Configurable room layout from a text file",
-      "Replay log written to disk for post-hoc inspection",
-      "A few preset strategy prompts to demo (cautious, greedy, exploratory)"
-    ]
-  },
-  "sign_offs": [
-    {
-      "phase": "handing-off",
-      "by": "human",
-      "actor": "kj",
-      "at": "2026-05-17T04:14:44.523Z",
-      "notes": "All decisions accepted, graph validates clean."
-    },
-    {
-      "phase": "handing-off",
-      "by": "human",
-      "actor": "kj",
-      "at": "2026-05-17T04:14:44.540Z"
-    }
-  ],
-  "handoff": {
-    "target": "filesystem",
-    "exported_at": "2026-05-17T04:14:44.540Z",
-    "issue_count": 7,
-    "document_count": 4
-  },
-  "gate_config": {
-    "preset": "poc"
-  },
-  "tags": []
-}
diff --git a/benchmarks/roguelike-ai-poc/reference/project.md b/benchmarks/roguelike-ai-poc/reference/project.md
deleted file mode 100644
index 538b476..0000000
--- a/benchmarks/roguelike-ai-poc/reference/project.md
+++ /dev/null
@@ -1,64 +0,0 @@
-# AI-driven roguelike POC
-
-| Field | Value |
-| --- | --- |
-| ID | `ai-driven-roguelike-poc` |
-| Status | `handed-off` |
-| Effort level | `poc` |
-| Created | 2026-05-17T04:12:02.030Z |
-| Updated | 2026-05-17T04:14:44.540Z |
-| Decisions | 4 |
-| Tasks | 7 |
-
-## Description
-
-A minimal roguelike where the player primes an AI agent with a strategy, then the agent autonomously navigates a single ASCII-rendered room over a tick system until it wins the objective or dies. Goal: prove the agent-as-player concept with the smallest viable surface area.
-
-## Scope
-
-**In scope**
-
-- A 10×10 ASCII-rendered single room with walls (#), floor (.), player (@), exit (>), and a hazard tile (X)
-- Tick-based game loop: each tick prints the frame, then queries the agent for one action
-- A small action vocabulary: move N/S/E/W and noop
-- Player has HP; stepping on hazard removes HP; reaching exit = win, HP=0 = death
-- Strategy prompt provided once at startup, fed to the agent as system prompt for every tick
-- LLM agent receives current frame + HP + tick number, returns a single action
-
-**Success criteria**
-
-- A user can run a single command, supply a strategy prompt, and watch the agent play until win or death
-- Win and death paths both observed in manual playtests
-- Different strategy prompts produce visibly different agent behavior
-- End-to-end run completes in under 60 seconds wall time on a typical OpenAI API call
-
-**Out of scope**
-
-- Multiple rooms, dungeon generation, procedural levels
-- Combat with enemies, NPCs, monsters
-- Inventory, items, equipment
-- Save/load, persistence
-- Visual UI beyond ASCII to terminal
-- Multiplayer, networking
-- Self-improving agent loops or RL training
-
-**Nice to have**
-
-- Configurable room layout from a text file
-- Replay log written to disk for post-hoc inspection
-- A few preset strategy prompts to demo (cautious, greedy, exploratory)
-
-## Sign-offs
-
-- **handing-off** by kj (human) at 2026-05-17T04:14:44.523Z — All decisions accepted, graph validates clean.
-
-- **handing-off** by kj (human) at 2026-05-17T04:14:44.540Z
-
-## Handoff
-
-| Field | Value |
-| --- | --- |
-| Target | `filesystem` |
-| Exported at | 2026-05-17T04:14:44.540Z |
-| Target ID | — |
-| Target URL | — |
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0001-bootstrap-repository.json b/benchmarks/roguelike-ai-poc/reference/tasks/T0001-bootstrap-repository.json
deleted file mode 100644
index c433a10..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0001-bootstrap-repository.json
+++ /dev/null
@@ -1,30 +0,0 @@
-{
-  "id": "T0001-bootstrap-repository",
-  "number": 1,
-  "slug": "bootstrap-repository",
-  "title": "Bootstrap repository",
-  "description": "Initialize the Python project layout: pyproject.toml or requirements.txt with openai pin, a src/ module path, a README stub, and a .gitignore. Verify a `python -c \"import openai\"` succeeds in a fresh venv.",
-  "status": "ready",
-  "estimate": {
-    "unit": "hours",
-    "value": 1,
-    "confidence": "high"
-  },
-  "acceptance_criteria": [
-    "pyproject.toml or requirements.txt committed",
-    "openai SDK installable in a venv",
-    "README explains 30-second quickstart",
-    "python -c \"from src import __init__\" runs"
-  ],
-  "depends_on": [],
-  "decision_refs": [
-    "0001-choose-the-implementation-language"
-  ],
-  "priority": "p0",
-  "labels": [
-    "foundation"
-  ],
-  "assignee_hint": "agent",
-  "created_at": "2026-05-17T04:14:22.524Z",
-  "updated_at": "2026-05-17T04:14:22.524Z"
-}
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0001-bootstrap-repository.md b/benchmarks/roguelike-ai-poc/reference/tasks/T0001-bootstrap-repository.md
deleted file mode 100644
index 09effaa..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0001-bootstrap-repository.md
+++ /dev/null
@@ -1,23 +0,0 @@
-# T0001-bootstrap-repository — Bootstrap repository
-
-| Field | Value |
-| --- | --- |
-| Status | `ready` |
-| Priority | `p0` |
-| Estimate | 1 hours (high confidence) |
-| Depends on | _(none)_ |
-| Decision refs | `0001-choose-the-implementation-language` — Choose the implementation language |
-| Assignee hint | agent |
-| Labels | `foundation` |
-| Updated | 2026-05-17T04:14:22.524Z |
-
-## Description
-
-Initialize the Python project layout: pyproject.toml or requirements.txt with openai pin, a src/ module path, a README stub, and a .gitignore. Verify a `python -c "import openai"` succeeds in a fresh venv.
-
-## Acceptance criteria
-
-- [ ] pyproject.toml or requirements.txt committed
-- [ ] openai SDK installable in a venv
-- [ ] README explains 30-second quickstart
-- [ ] python -c "from src import __init__" runs
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0002-implement-world-module-tile-grid-entity-dict.json b/benchmarks/roguelike-ai-poc/reference/tasks/T0002-implement-world-module-tile-grid-entity-dict.json
deleted file mode 100644
index c7a6c75..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0002-implement-world-module-tile-grid-entity-dict.json
+++ /dev/null
@@ -1,32 +0,0 @@
-{
-  "id": "T0002-implement-world-module-tile-grid-entity-dict",
-  "number": 2,
-  "slug": "implement-world-module-tile-grid-entity-dict",
-  "title": "Implement world module (tile grid + entity dict)",
-  "description": "Build src/world.py: World dataclass with static_tiles: list[list[str]] and entities: dict[str, dict]. Provide constructors for a default 10×10 room (walls border, one hazard, one exit). Pure data and helpers; no rendering, no game logic.",
-  "status": "open",
-  "estimate": {
-    "unit": "hours",
-    "value": 2,
-    "confidence": "med"
-  },
-  "acceptance_criteria": [
-    "World.default_room() returns a valid 10x10 with #, ., X, > tiles",
-    "entities dict contains a player at a known spawn",
-    "is_walkable(x,y) returns False for walls, True for floor and hazard",
-    "unit test: default room is fully walkable from spawn to exit"
-  ],
-  "depends_on": [
-    "T0001-bootstrap-repository"
-  ],
-  "decision_refs": [
-    "0002-define-the-world-representation-and-renderer"
-  ],
-  "priority": "p0",
-  "labels": [
-    "core"
-  ],
-  "assignee_hint": "agent",
-  "created_at": "2026-05-17T04:14:22.526Z",
-  "updated_at": "2026-05-17T04:14:22.526Z"
-}
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0002-implement-world-module-tile-grid-entity-dict.md b/benchmarks/roguelike-ai-poc/reference/tasks/T0002-implement-world-module-tile-grid-entity-dict.md
deleted file mode 100644
index ff06ca3..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0002-implement-world-module-tile-grid-entity-dict.md
+++ /dev/null
@@ -1,23 +0,0 @@
-# T0002-implement-world-module-tile-grid-entity-dict — Implement world module (tile grid + entity dict)
-
-| Field | Value |
-| --- | --- |
-| Status | `open` |
-| Priority | `p0` |
-| Estimate | 2 hours (med confidence) |
-| Depends on | `T0001-bootstrap-repository` |
-| Decision refs | `0002-define-the-world-representation-and-renderer` — Define the world representation and renderer |
-| Assignee hint | agent |
-| Labels | `core` |
-| Updated | 2026-05-17T04:14:22.526Z |
-
-## Description
-
-Build src/world.py: World dataclass with static_tiles: list[list[str]] and entities: dict[str, dict]. Provide constructors for a default 10×10 room (walls border, one hazard, one exit). Pure data and helpers; no rendering, no game logic.
-
-## Acceptance criteria
-
-- [ ] World.default_room() returns a valid 10x10 with #, ., X, > tiles
-- [ ] entities dict contains a player at a known spawn
-- [ ] is_walkable(x,y) returns False for walls, True for floor and hazard
-- [ ] unit test: default room is fully walkable from spawn to exit
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0003-implement-frame-renderer.json b/benchmarks/roguelike-ai-poc/reference/tasks/T0003-implement-frame-renderer.json
deleted file mode 100644
index 0caf6b1..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0003-implement-frame-renderer.json
+++ /dev/null
@@ -1,32 +0,0 @@
-{
-  "id": "T0003-implement-frame-renderer",
-  "number": 3,
-  "slug": "implement-frame-renderer",
-  "title": "Implement frame renderer",
-  "description": "Build src/render.py: render_frame(world) -> list[str]. Compose static_tiles + entity glyphs (entity overrides tile). Provide a small HUD line below the frame showing tick number, HP, and last action. Return as list of strings so the game loop can join + print or send to LLM.",
-  "status": "open",
-  "estimate": {
-    "unit": "hours",
-    "value": 1,
-    "confidence": "high"
-  },
-  "acceptance_criteria": [
-    "render_frame returns 10 strings of length 10",
-    "player @ is visible at its current position",
-    "HUD line includes tick, hp, last_action",
-    "manual visual check: frame looks like a roguelike room"
-  ],
-  "depends_on": [
-    "T0002-implement-world-module-tile-grid-entity-dict"
-  ],
-  "decision_refs": [
-    "0002-define-the-world-representation-and-renderer"
-  ],
-  "priority": "p0",
-  "labels": [
-    "core"
-  ],
-  "assignee_hint": "agent",
-  "created_at": "2026-05-17T04:14:22.527Z",
-  "updated_at": "2026-05-17T04:14:22.527Z"
-}
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0003-implement-frame-renderer.md b/benchmarks/roguelike-ai-poc/reference/tasks/T0003-implement-frame-renderer.md
deleted file mode 100644
index 8bfc535..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0003-implement-frame-renderer.md
+++ /dev/null
@@ -1,23 +0,0 @@
-# T0003-implement-frame-renderer — Implement frame renderer
-
-| Field | Value |
-| --- | --- |
-| Status | `open` |
-| Priority | `p0` |
-| Estimate | 1 hours (high confidence) |
-| Depends on | `T0002-implement-world-module-tile-grid-entity-dict` |
-| Decision refs | `0002-define-the-world-representation-and-renderer` — Define the world representation and renderer |
-| Assignee hint | agent |
-| Labels | `core` |
-| Updated | 2026-05-17T04:14:22.527Z |
-
-## Description
-
-Build src/render.py: render_frame(world) -> list[str]. Compose static_tiles + entity glyphs (entity overrides tile). Provide a small HUD line below the frame showing tick number, HP, and last action. Return as list of strings so the game loop can join + print or send to LLM.
-
-## Acceptance criteria
-
-- [ ] render_frame returns 10 strings of length 10
-- [ ] player @ is visible at its current position
-- [ ] HUD line includes tick, hp, last_action
-- [ ] manual visual check: frame looks like a roguelike room
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0004-implement-openai-agent-client.json b/benchmarks/roguelike-ai-poc/reference/tasks/T0004-implement-openai-agent-client.json
deleted file mode 100644
index cdc8821..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0004-implement-openai-agent-client.json
+++ /dev/null
@@ -1,34 +0,0 @@
-{
-  "id": "T0004-implement-openai-agent-client",
-  "number": 4,
-  "slug": "implement-openai-agent-client",
-  "title": "Implement OpenAI agent client",
-  "description": "Build src/agent.py: AgentClient class with constructor(strategy_prompt, model, api_key). Single method choose_action(world_state_json, tick, hp) → (direction, reasoning). Uses tool-calling with one tool do_action(direction in {N,S,E,W,noop}); tool_choice=\"required\". Returns the chosen direction and the assistant message content as reasoning.",
-  "status": "open",
-  "estimate": {
-    "unit": "hours",
-    "value": 2,
-    "confidence": "med"
-  },
-  "acceptance_criteria": [
-    "AgentClient instantiates without making a call",
-    "choose_action returns a valid direction enum",
-    "reasoning is captured as a string (may be empty)",
-    "malformed responses raise a clear error (does not silently noop)",
-    "strategy_prompt is in the system role on every call"
-  ],
-  "depends_on": [
-    "T0001-bootstrap-repository"
-  ],
-  "decision_refs": [
-    "0003-define-the-agent-action-contract"
-  ],
-  "priority": "p0",
-  "labels": [
-    "llm",
-    "core"
-  ],
-  "assignee_hint": "agent",
-  "created_at": "2026-05-17T04:14:22.528Z",
-  "updated_at": "2026-05-17T04:14:22.528Z"
-}
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0004-implement-openai-agent-client.md b/benchmarks/roguelike-ai-poc/reference/tasks/T0004-implement-openai-agent-client.md
deleted file mode 100644
index 0244119..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0004-implement-openai-agent-client.md
+++ /dev/null
@@ -1,24 +0,0 @@
-# T0004-implement-openai-agent-client — Implement OpenAI agent client
-
-| Field | Value |
-| --- | --- |
-| Status | `open` |
-| Priority | `p0` |
-| Estimate | 2 hours (med confidence) |
-| Depends on | `T0001-bootstrap-repository` |
-| Decision refs | `0003-define-the-agent-action-contract` — Define the agent action contract |
-| Assignee hint | agent |
-| Labels | `llm`, `core` |
-| Updated | 2026-05-17T04:14:22.528Z |
-
-## Description
-
-Build src/agent.py: AgentClient class with constructor(strategy_prompt, model, api_key). Single method choose_action(world_state_json, tick, hp) → (direction, reasoning). Uses tool-calling with one tool do_action(direction in {N,S,E,W,noop}); tool_choice="required". Returns the chosen direction and the assistant message content as reasoning.
-
-## Acceptance criteria
-
-- [ ] AgentClient instantiates without making a call
-- [ ] choose_action returns a valid direction enum
-- [ ] reasoning is captured as a string (may be empty)
-- [ ] malformed responses raise a clear error (does not silently noop)
-- [ ] strategy_prompt is in the system role on every call
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0005-implement-action-handlers-and-termination-checks.json b/benchmarks/roguelike-ai-poc/reference/tasks/T0005-implement-action-handlers-and-termination-checks.json
deleted file mode 100644
index 20ad30f..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0005-implement-action-handlers-and-termination-checks.json
+++ /dev/null
@@ -1,33 +0,0 @@
-{
-  "id": "T0005-implement-action-handlers-and-termination-checks",
-  "number": 5,
-  "slug": "implement-action-handlers-and-termination-checks",
-  "title": "Implement action handlers and termination checks",
-  "description": "Build src/actions.py: apply_action(world, direction) -> ActionResult. Moves the player one cell if walkable; otherwise noop. Compute side effects: HP-1 when stepping onto hazard, win flag when player_pos == exit_pos, dead flag when HP <= 0. Return ActionResult dataclass with new_world, hp_delta, terminal, terminal_reason.",
-  "status": "open",
-  "estimate": {
-    "unit": "hours",
-    "value": 1,
-    "confidence": "high"
-  },
-  "acceptance_criteria": [
-    "Moving into a wall is a noop with no HP change",
-    "Moving onto hazard triggers hp_delta = -1",
-    "Moving onto exit triggers terminal=\"win\"",
-    "HP reaching 0 triggers terminal=\"death\"",
-    "Unit tests for each transition"
-  ],
-  "depends_on": [
-    "T0002-implement-world-module-tile-grid-entity-dict"
-  ],
-  "decision_refs": [
-    "0002-define-the-world-representation-and-renderer"
-  ],
-  "priority": "p0",
-  "labels": [
-    "core"
-  ],
-  "assignee_hint": "agent",
-  "created_at": "2026-05-17T04:14:22.529Z",
-  "updated_at": "2026-05-17T04:14:22.529Z"
-}
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0005-implement-action-handlers-and-termination-checks.md b/benchmarks/roguelike-ai-poc/reference/tasks/T0005-implement-action-handlers-and-termination-checks.md
deleted file mode 100644
index 5ad2496..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0005-implement-action-handlers-and-termination-checks.md
+++ /dev/null
@@ -1,24 +0,0 @@
-# T0005-implement-action-handlers-and-termination-checks — Implement action handlers and termination checks
-
-| Field | Value |
-| --- | --- |
-| Status | `open` |
-| Priority | `p0` |
-| Estimate | 1 hours (high confidence) |
-| Depends on | `T0002-implement-world-module-tile-grid-entity-dict` |
-| Decision refs | `0002-define-the-world-representation-and-renderer` — Define the world representation and renderer |
-| Assignee hint | agent |
-| Labels | `core` |
-| Updated | 2026-05-17T04:14:22.529Z |
-
-## Description
-
-Build src/actions.py: apply_action(world, direction) -> ActionResult. Moves the player one cell if walkable; otherwise noop. Compute side effects: HP-1 when stepping onto hazard, win flag when player_pos == exit_pos, dead flag when HP <= 0. Return ActionResult dataclass with new_world, hp_delta, terminal, terminal_reason.
-
-## Acceptance criteria
-
-- [ ] Moving into a wall is a noop with no HP change
-- [ ] Moving onto hazard triggers hp_delta = -1
-- [ ] Moving onto exit triggers terminal="win"
-- [ ] HP reaching 0 triggers terminal="death"
-- [ ] Unit tests for each transition
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0006-implement-the-tick-based-game-loop.json b/benchmarks/roguelike-ai-poc/reference/tasks/T0006-implement-the-tick-based-game-loop.json
deleted file mode 100644
index 129cd6b..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0006-implement-the-tick-based-game-loop.json
+++ /dev/null
@@ -1,35 +0,0 @@
-{
-  "id": "T0006-implement-the-tick-based-game-loop",
-  "number": 6,
-  "slug": "implement-the-tick-based-game-loop",
-  "title": "Implement the tick-based game loop",
-  "description": "Build src/loop.py: run_game(world, agent_client, max_steps=50). Each iteration: render frame, call agent_client.choose_action, apply action, check terminal, sleep 0.05s, repeat. Logs each tick: tick#, frame, action, reasoning excerpt, hp. Exits on terminal or step cap; returns final state + reason.",
-  "status": "open",
-  "estimate": {
-    "unit": "hours",
-    "value": 2,
-    "confidence": "med"
-  },
-  "acceptance_criteria": [
-    "Loop terminates on win, death, or step cap (≤50)",
-    "Each tick prints the frame and HUD to stdout",
-    "Final summary line shows reason and step count",
-    "No exceptions leak from agent timeouts/errors (logged and treated as noop)"
-  ],
-  "depends_on": [
-    "T0003-implement-frame-renderer",
-    "T0004-implement-openai-agent-client",
-    "T0005-implement-action-handlers-and-termination-checks"
-  ],
-  "decision_refs": [
-    "0004-define-the-tick-loop-and-termination-conditions",
-    "0002-define-the-world-representation-and-renderer"
-  ],
-  "priority": "p0",
-  "labels": [
-    "core"
-  ],
-  "assignee_hint": "agent",
-  "created_at": "2026-05-17T04:14:22.530Z",
-  "updated_at": "2026-05-17T04:14:22.530Z"
-}
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0006-implement-the-tick-based-game-loop.md b/benchmarks/roguelike-ai-poc/reference/tasks/T0006-implement-the-tick-based-game-loop.md
deleted file mode 100644
index 3338646..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0006-implement-the-tick-based-game-loop.md
+++ /dev/null
@@ -1,23 +0,0 @@
-# T0006-implement-the-tick-based-game-loop — Implement the tick-based game loop
-
-| Field | Value |
-| --- | --- |
-| Status | `open` |
-| Priority | `p0` |
-| Estimate | 2 hours (med confidence) |
-| Depends on | `T0003-implement-frame-renderer`, `T0004-implement-openai-agent-client`, `T0005-implement-action-handlers-and-termination-checks` |
-| Decision refs | `0004-define-the-tick-loop-and-termination-conditions` — Define the tick loop and termination conditions; `0002-define-the-world-representation-and-renderer` — Define the world representation and renderer |
-| Assignee hint | agent |
-| Labels | `core` |
-| Updated | 2026-05-17T04:14:22.530Z |
-
-## Description
-
-Build src/loop.py: run_game(world, agent_client, max_steps=50). Each iteration: render frame, call agent_client.choose_action, apply action, check terminal, sleep 0.05s, repeat. Logs each tick: tick#, frame, action, reasoning excerpt, hp. Exits on terminal or step cap; returns final state + reason.
-
-## Acceptance criteria
-
-- [ ] Loop terminates on win, death, or step cap (≤50)
-- [ ] Each tick prints the frame and HUD to stdout
-- [ ] Final summary line shows reason and step count
-- [ ] No exceptions leak from agent timeouts/errors (logged and treated as noop)
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0007-implement-cli-entry-script.json b/benchmarks/roguelike-ai-poc/reference/tasks/T0007-implement-cli-entry-script.json
deleted file mode 100644
index 030f430..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0007-implement-cli-entry-script.json
+++ /dev/null
@@ -1,33 +0,0 @@
-{
-  "id": "T0007-implement-cli-entry-script",
-  "number": 7,
-  "slug": "implement-cli-entry-script",
-  "title": "Implement CLI entry script",
-  "description": "Build src/__main__.py: argparse for --strategy (or read from stdin), --model (default gpt-4o), --max-steps (default 50). Construct AgentClient, build default room, call run_game. Print the final outcome. Document the env vars (OPENAI_API_KEY) and a sample invocation in README.",
-  "status": "open",
-  "estimate": {
-    "unit": "hours",
-    "value": 1,
-    "confidence": "high"
-  },
-  "acceptance_criteria": [
-    "python -m src --strategy \"cautious explorer\" runs end-to-end",
-    "README has a complete example invocation",
-    "--help prints usage",
-    "Exit code 0 on win/timeout, 1 on death (so scripts can chain)"
-  ],
-  "depends_on": [
-    "T0006-implement-the-tick-based-game-loop"
-  ],
-  "decision_refs": [
-    "0001-choose-the-implementation-language",
-    "0004-define-the-tick-loop-and-termination-conditions"
-  ],
-  "priority": "p0",
-  "labels": [
-    "cli"
-  ],
-  "assignee_hint": "agent",
-  "created_at": "2026-05-17T04:14:22.532Z",
-  "updated_at": "2026-05-17T04:14:22.532Z"
-}
diff --git a/benchmarks/roguelike-ai-poc/reference/tasks/T0007-implement-cli-entry-script.md b/benchmarks/roguelike-ai-poc/reference/tasks/T0007-implement-cli-entry-script.md
deleted file mode 100644
index ba9f268..0000000
--- a/benchmarks/roguelike-ai-poc/reference/tasks/T0007-implement-cli-entry-script.md
+++ /dev/null
@@ -1,23 +0,0 @@
-# T0007-implement-cli-entry-script — Implement CLI entry script
-
-| Field | Value |
-| --- | --- |
-| Status | `open` |
-| Priority | `p0` |
-| Estimate | 1 hours (high confidence) |
-| Depends on | `T0006-implement-the-tick-based-game-loop` |
-| Decision refs | `0001-choose-the-implementation-language` — Choose the implementation language; `0004-define-the-tick-loop-and-termination-conditions` — Define the tick loop and termination conditions |
-| Assignee hint | agent |
-| Labels | `cli` |
-| Updated | 2026-05-17T04:14:22.532Z |
-
-## Description
-
-Build src/__main__.py: argparse for --strategy (or read from stdin), --model (default gpt-4o), --max-steps (default 50). Construct AgentClient, build default room, call run_game. Print the final outcome. Document the env vars (OPENAI_API_KEY) and a sample invocation in README.
-
-## Acceptance criteria
-
-- [ ] python -m src --strategy "cautious explorer" runs end-to-end
-- [ ] README has a complete example invocation
-- [ ] --help prints usage
-- [ ] Exit code 0 on win/timeout, 1 on death (so scripts can chain)
diff --git a/benchmarks/roguelike-ai-poc/run.sh b/benchmarks/roguelike-ai-poc/run.sh
deleted file mode 100755
index 67915d1..0000000
--- a/benchmarks/roguelike-ai-poc/run.sh
+++ /dev/null
@@ -1,35 +0,0 @@
-#!/usr/bin/env bash
-# Run the roguelike-ai-poc benchmark prompt against a fresh tmp dir.
-# Requires OPENAI_API_KEY in the environment.
-# Usage:
-#   ./run.sh                            # run with defaults
-#   OUT=./my-output ./run.sh            # specify output dir
-#   MODEL=gpt-4o-mini ./run.sh          # override model
-
-set -euo pipefail
-
-if [[ -z "${OPENAI_API_KEY:-}" ]]; then
-  echo "OPENAI_API_KEY not set — refusing to run." >&2
-  exit 2
-fi
-
-HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-REPO_ROOT="$(cd "$HERE/../.." && pwd)"
-OUT="${OUT:-$(mktemp -d -t dr-bench-roguelike-XXXX)}"
-
-DESCRIPTION="A minimal roguelike where the player primes an AI agent with a strategy, then the agent autonomously navigates a single ASCII-rendered room over a tick system until it wins the objective or dies. Goal: prove the agent-as-player concept with the smallest viable surface area."
-
-cd "$REPO_ROOT/server"
-[[ -f dist/cli.js ]] || npm run build >&2
-
-node dist/cli.js \
-  --title "AI-driven roguelike POC" \
-  --description "$DESCRIPTION" \
-  --effort poc \
-  --cwd "$OUT" \
-  --yes \
-  ${MODEL:+--model "$MODEL"}
-
-echo ""
-echo "── Benchmark artifacts at: $OUT"
-echo "Compare with: $HERE/reference/"
diff --git a/docs/README.md b/docs/README.md
index 2063fb4..8de24b6 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -22,7 +22,7 @@ The decision-record docs follow the [Diátaxis](https://diataxis.fr) framework 
 ## Index
 
 ### Tutorials
-- [Your first plan](tutorials/your-first-plan.md) — run the roguelike benchmark prompt end-to-end
+- [Your first plan](tutorials/your-first-plan.md) — run the pipeline end-to-end on a small idea
 
 ### How-to guides
 - [Install the plugin or CLI](how-to/install.md)
diff --git a/docs/reference/data-model.md b/docs/reference/data-model.md
index 420fc42..1235b96 100644
--- a/docs/reference/data-model.md
+++ b/docs/reference/data-model.md
@@ -146,7 +146,7 @@ One JSON line per pipeline action. Append-only audit log.
 | Entity | Format | Example |
 |---|---|---|
 | Decision | `<4-digit>-<slug>` | `0003-define-the-agent-action-contract` |
-| Task | `T<4-digit>-<slug>` | `T0006-implement-the-tick-based-game-loop` |
-| Project | kebab-slug | `ai-driven-roguelike-poc` |
+| Task | `T<4-digit>-<slug>` | `T0006-implement-the-rate-limiter` |
+| Project | kebab-slug | `contact-list-deduper` |
 
 Slugs are 2–64 chars, lower-case alphanumerics + dashes, no leading/trailing dash.
diff --git a/docs/tutorials/your-first-plan.md b/docs/tutorials/your-first-plan.md
index 7f60435..afe9bd8 100644
--- a/docs/tutorials/your-first-plan.md
+++ b/docs/tutorials/your-first-plan.md
@@ -2,7 +2,7 @@
 
 By the end of this tutorial you will have used decision-record to turn a one-line idea into a complete, scoped, decision-backed, task-decomposed MVP plan — and you will have looked at every artifact the system produces. This takes about 15 minutes.
 
-We will use the **roguelike-ai-poc** benchmark idea — a small but real planning problem — so you can see the system handle something other than `hello world`.
+We'll use a small, neutral example idea — a CLI tool that deduplicates contact lists — so you can see the system handle something real without much setup.
 
 ## Before you start
 
@@ -24,7 +24,7 @@ You do **not** need the Claude Code plugin installed for this tutorial. We will
 
 ## Step 1: Pick a working directory
 
-The system writes artifacts into a target project directory. We will create a fresh one:
+The system writes artifacts into a target project directory. Create a fresh one:
 
 ```bash
 mkdir -p ~/dev/my-first-plan
@@ -40,7 +40,7 @@ From the `decision-record/server/` directory:
 export OPENAI_API_KEY=sk-…   # if you haven't already
 
 node dist/cli.js \
-  --idea "a CLI tool that converts QuickBooks CSV exports into a normalized double-entry ledger" \
+  --idea "a CLI tool that reads CSVs of contacts and merges fuzzy duplicates" \
   --effort poc \
   --cwd ~/dev/my-first-plan
 ```
@@ -56,13 +56,13 @@ The CLI will print colored progress to stderr as each phase runs. You will see s
   Target: /Users/you/dev/my-first-plan
   Model: gpt-4o
 ━━━ Phase: Intake ━━━
-✓ Initialized 'a-cli-tool-that-converts-quickbooks-csv-export…' at effort_level=poc
+✓ Initialized 'a-cli-tool-that-reads-csvs-of-contacts…' at effort_level=poc
 ✓ Advanced: intake → scoping
 ━━━ Phase: Scoping ━━━
   Running scoping agent…
 ✓ Scoping agent finished (3 tool calls).
 ────────────────────────────────────────────────────────────
-Scope set. in_scope: read QuickBooks CSV, parse rows…
+Scope set. in_scope: read CSV, normalize fields, detect duplicates…
 …
 ────────────────────────────────────────────────────────────
 ✓ Advanced: scoping → deciding
@@ -121,7 +121,7 @@ Pick one. For example:
 cat ~/dev/my-first-plan/dr/decisions/0001-*.md
 ```
 
-You will see the full record: issue, positions considered, the selected position, the argument for why it won, the implications, and five lens reviews from the skeptic.
+You will see the full record: issue, positions considered, the selected position, the argument for why it won, the implications, and (under `mvp`/`full` presets) lens reviews from the skeptic.
 
 ```bash
 cat ~/dev/my-first-plan/dr/decisions/0001-*.json | jq .

ID	Title	Status	Selected	Depends on
0001-choose-the-implementation-language	Choose the implementation language [architecture]	accepted	Python	—
0002-define-the-world-representation-and-renderer	Define the world representation and renderer [data-model]	accepted	Tile-grid + entity dict	—
0003-define-the-agent-action-contract	Define the agent action contract [architecture]	accepted	Tool-call (function calling) with one tool: do_action(direction)	—
0004-define-the-tick-loop-and-termination-conditions	Define the tick loop and termination conditions [architecture]	accepted	Synchronous loop with step cap	—
ID	Title	Status	Pri	Estimate	Depends on	Decision refs
T0001-bootstrap-repository	Bootstrap repository	ready	p0	1h	—	0001-choose-the-implementation-language
T0002-implement-world-module-tile-grid-entity-dict	Implement world module (tile grid + entity dict)	open	p0	2h	T0001-bootstrap-repository	0002-define-the-world-representation-and-renderer
T0003-implement-frame-renderer	Implement frame renderer	open	p0	1h	T0002-implement-world-module-tile-grid-entity-dict	0002-define-the-world-representation-and-renderer
T0004-implement-openai-agent-client	Implement OpenAI agent client	open	p0	2h	T0001-bootstrap-repository	0003-define-the-agent-action-contract
T0005-implement-action-handlers-and-termination-checks	Implement action handlers and termination checks	open	p0	1h	T0002-implement-world-module-tile-grid-entity-dict	0002-define-the-world-representation-and-renderer
T0006-implement-the-tick-based-game-loop	Implement the tick-based game loop	open	p0	2h	T0003-implement-frame-renderer T0004-implement-openai-agent-client T0005-implement-action-handlers-and-termination-checks	0004-define-the-tick-loop-and-termination-conditions 0002-define-the-world-representation-and-renderer
T0007-implement-cli-entry-script	Implement CLI entry script	open	p0	1h	T0006-implement-the-tick-based-game-loop	0001-choose-the-implementation-language 0004-define-the-tick-loop-and-termination-conditions