protoLabsAI · mabry1985 · May 17, 2026 · May 17, 2026
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -0,0 +1,41 @@
+name: test
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    defaults:
+      run:
+        working-directory: server
+    strategy:
+      matrix:
+        node-version: [20, 22]
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Node ${{ matrix.node-version }}
+        uses: actions/setup-node@v4
+        with:
+          node-version: ${{ matrix.node-version }}
+          cache: npm
+          cache-dependency-path: server/package-lock.json
+
+      - name: Install dependencies
+        run: npm ci
+
+      - name: Type check
+        run: npm run typecheck
+
+      - name: Build
+        run: npm run build
+
+      - name: Unit tests
+        run: npm run test:unit
+
+      - name: Flow tests
+        run: npm run test:flow
diff --git a/CITATION.cff b/CITATION.cff
@@ -33,5 +33,5 @@ references:
     repository-code: 'https://github.com/joelparkerhenderson/decision-record/'
     abstract: >-
       The canonical concept, template, and teamwork model for decision
-      records — preserved in this fork at docs/upstream-canon.md and
-      templates/canonical.md.
+      records — preserved in this fork at docs/explanation/why-decision-records.md
+      and templates/canonical.md.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -24,7 +24,7 @@ This repo is the planning system itself. We deliberately stop at the handoff —
 
 ## Attribution
 
-The conceptual core derives from Joel Parker Henderson's [canonical decision-record repo](https://github.com/joelparkerhenderson/decision-record). Preserve attribution to upstream in any rework of `docs/upstream-canon.md` or `templates/canonical.md`.
+The conceptual core derives from Joel Parker Henderson's [canonical decision-record repo](https://github.com/joelparkerhenderson/decision-record). Preserve attribution to upstream in any rework of `docs/explanation/why-decision-records.md` or `templates/canonical.md`.
 
 ## License
 

diff --git a/LICENSE b/LICENSE
@@ -22,8 +22,8 @@ SOFTWARE.
 
 ---
 
-The preserved canonical material in `docs/upstream-canon.md` and the
-canonical decision record template at `templates/canonical.md` derive from
+The preserved canonical material in `docs/explanation/why-decision-records.md`
+and the canonical decision record template at `templates/canonical.md` derive from
 the upstream work of Joel Parker Henderson:
 <https://github.com/joelparkerhenderson/decision-record>. That material
 should be attributed to its original author; see CITATION.cff.
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 
 This repository is a Claude Code plugin + bundled MCP server. It runs inside a fresh or template repo, partners with a human and an AI agent, and produces an executable MVP plan: a scoped manifest, a set of accepted decision records, and a dependency-aware task graph. Output goes to Linear (primary) or stays as filesystem artifacts (fallback).
 
-This project is a derivative of [Joel Parker Henderson's canonical decision-record repo](https://github.com/joelparkerhenderson/decision-record). The canonical explanation of what a DR is and why it matters is preserved at [`docs/upstream-canon.md`](docs/upstream-canon.md). What this fork adds is **enforcement**: workflows, tools, and a state machine that make DRs a non-skippable part of planning with an agentic system.
+This project is a derivative of [Joel Parker Henderson's canonical decision-record repo](https://github.com/joelparkerhenderson/decision-record). The canonical explanation of what a DR is and why it matters is preserved at [`docs/explanation/why-decision-records.md`](docs/explanation/why-decision-records.md). What this fork adds is **enforcement**: workflows, tools, and a state machine that make DRs a non-skippable part of planning with an agentic system.
 
 ## What you get
 
@@ -17,7 +17,16 @@ This project is a derivative of [Joel Parker Henderson's canonical decision-reco
 
 ## Status
 
-Active development — first usable cut is in. The pipeline is functional end-to-end (intake → scope → decisions → tasks → handoff to filesystem or Linear). See [`docs/quickstart.md`](docs/quickstart.md) for the five-minute walkthrough, [`docs/usage.md`](docs/usage.md) for the full interaction model, and [`docs/architecture.md`](docs/architecture.md) for the data model.
+Active development — first usable cut is in. The pipeline is functional end-to-end (intake → scope → decisions → tasks → handoff to filesystem or Linear). A standalone CLI (`decision-record`) ships alongside the Claude Code plugin and MCP server.
+
+## Documentation
+
+Docs follow the [Diátaxis](https://diataxis.fr) framework — start at [`docs/README.md`](docs/README.md) to orient.
+
+- **Brand new?** → [`docs/tutorials/your-first-plan.md`](docs/tutorials/your-first-plan.md) is a 15-minute end-to-end walkthrough.
+- **How do I do X?** → [`docs/how-to/`](docs/how-to/) (install, run the CLI, configure providers, hand off to Linear, calibrate gates).
+- **What's the exact spec?** → [`docs/reference/`](docs/reference/) (CLI flags, MCP tools, data model, gates).
+- **Why is it built this way?** → [`docs/explanation/`](docs/explanation/) (design rationale, the five phases, why decision records).
 
 ## How it's structured
 
@@ -58,18 +67,26 @@ npm install
 npm run build
 ```
 
-Then either link as a Claude Code plugin (symlink the repo into `~/.claude/plugins/decision-record/`) or run the MCP server standalone via `node /path/to/decision-record/server/dist/index.js`. Full instructions: [`docs/quickstart.md`](docs/quickstart.md).
+Then either:
+- Use the **standalone CLI**: `export OPENAI_API_KEY=… && node dist/cli.js --idea "your idea here"`
+- Use the **Claude Code plugin**: symlink the repo into `~/.claude/plugins/decision-record/` and run `/plan` inside Claude Code.
+
+Full install instructions: [`docs/how-to/install.md`](docs/how-to/install.md). First-run walkthrough: [`docs/tutorials/your-first-plan.md`](docs/tutorials/your-first-plan.md).
 
 (A published marketplace release is on the roadmap.)
 
+## Benchmarks
+
+We use a canonical prompt — an AI-driven roguelike POC — to spot regressions as the system evolves. See [`benchmarks/`](benchmarks/) for the prompt, expected output shape, and a `run.sh` to re-run it.
+
 ## Contributing
 
 See [CONTRIBUTING.md](CONTRIBUTING.md). Issues and pull requests welcome.
 
 ## Acknowledgments
 
-The conceptual core — what a decision record is, the canonical template structure, the teamwork model around DRs — is the work of [Joel Parker Henderson](https://joelparkerhenderson.com). See [`docs/upstream-canon.md`](docs/upstream-canon.md) for the preserved canonical material, and [CITATION.cff](CITATION.cff) for citation metadata.
+The conceptual core — what a decision record is, the canonical template structure, the teamwork model around DRs — is the work of [Joel Parker Henderson](https://joelparkerhenderson.com). See [`docs/explanation/why-decision-records.md`](docs/explanation/why-decision-records.md) for the preserved canonical material, and [CITATION.cff](CITATION.cff) for citation metadata.
 
 ## License
 
-[MIT](LICENSE) — for the code, schemas, and tooling in this repository. The preserved canonical content in `docs/upstream-canon.md` and the canonical template at `templates/canonical.md` derive from upstream and should be attributed to Joel Parker Henderson per CITATION.cff.
+[MIT](LICENSE) — for the code, schemas, and tooling in this repository. The preserved canonical content in `docs/explanation/why-decision-records.md` and the canonical template at `templates/canonical.md` derive from upstream and should be attributed to Joel Parker Henderson per CITATION.cff.
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -0,0 +1,32 @@
+# Benchmarks
+
+Canonical prompts we run against the decision-record planning pipeline to catch regressions as the system evolves.
+
+| Benchmark | Prompt | Effort | Purpose |
+|---|---|---|---|
+| [roguelike-ai-poc](roguelike-ai-poc/) | AI-driven roguelike where the agent plays the game | `poc` | Exercises all five pipeline phases on a small, well-bounded problem. The original dogfood case. |
+
+## How to run a benchmark
+
+```bash
+cd benchmarks/<name>
+./run.sh
+```
+
+Each benchmark has:
+
+- `prompt.md` — the exact idea, effort level, and what "good output" looks like
+- `reference/` — a baseline artifact snapshot from a canonical run
+- `run.sh` — one-shot runner that fires the CLI against a fresh tmp dir
+
+## What we look for when comparing runs
+
+Each benchmark's `prompt.md` defines its own success criteria. Generally:
+
+- Pipeline reaches `handed-off`
+- Decision count and shape match expectations for the effort tier
+- Tasks are vertical slices, every leaf has a decision ref, graph validates
+- Render artifacts are emitted (Markdown + HTML)
+- Event log is coherent
+
+These benchmarks are **not unit tests** — they're regression observability. Different runs will produce slightly different plans and that's by design. Treat the reference as "shape we expect," not "bytes we require."
diff --git a/benchmarks/roguelike-ai-poc/prompt.md b/benchmarks/roguelike-ai-poc/prompt.md
@@ -0,0 +1,63 @@
+# Benchmark: roguelike-ai-poc
+
+This is the canonical benchmark for the decision-record planning pipeline. We re-run it as the system evolves to spot regressions in plan quality, gate behavior, agent prompts, and rendering.
+
+## The prompt
+
+**Idea (free-form):**
+
+> A minimal roguelike where the player primes an AI agent with a strategy, then the agent autonomously navigates a single ASCII-rendered room over a tick system until it wins the objective or dies. Goal: prove the agent-as-player concept with the smallest viable surface area.
+
+**Effort level:** `poc`
+
+## Invocation
+
+```bash
+decision-record \
+  --title "AI-driven roguelike POC" \
+  --description "$(cat <<'EOF'
+A minimal roguelike where the player primes an AI agent with a strategy, then the agent autonomously navigates a single ASCII-rendered room over a tick system until it wins the objective or dies. Goal: prove the agent-as-player concept with the smallest viable surface area.
+EOF
+)" \
+  --effort poc \
+  --cwd ./tmp-roguelike-bench \
+  --yes
+```
+
+Or the one-shot wrapper: `./run.sh` (creates a fresh tmp dir, runs the CLI, prints where the artifacts landed).
+
+## What "good output" looks like
+
+A run is healthy if the produced plan:
+
+- **Pipeline reaches `handed-off`** — every gate passes, sign-offs recorded, project finalized.
+- **3-5 significant decisions** are proposed and accepted — language, world representation, agent action contract, tick-loop control. (Not 1; not 12.)
+- **5-8 vertical-slice tasks** — bootstrap → world → renderer → agent client → action handlers → game loop → CLI entry. Every leaf ≤ 16h (poc cap). Every task references at least one accepted DR.
+- **The seed library is consulted** for at least the language decision (`dr_seed_search` + `dr_seed_load` on `language-choice`).
+- **Graph validates clean** — no cycles, no orphan deps, no missing decision refs.
+- **Artifacts emitted** — `dr/project.json`, `dr/decisions/*.json`, `dr/tasks/*.json`, rendered `.md` siblings, `dr/index.html`. `.dr/events.jsonl` contains a coherent audit trail.
+
+## Reference snapshot
+
+`./reference/` holds the artifacts from the canonical run produced by hand-driving the MCP tools (2026-05-16, the dogfood test that originally produced this benchmark). Treat it as a "this is what good looks like" baseline, not a strict equality target — different agent runs will pick slightly different positions, phrasing, and task decomposition, and that's fine.
+
+When comparing a new run against `./reference/`:
+
+- **Same final phase, gate decisions, event mix** → no regression.
+- **More/fewer decisions or tasks** → check whether the new run is denser/sparser appropriately or whether the agent over- or under-decomposed.
+- **Different selected positions** → fine if defensible; concerning if the argument is weaker.
+- **Missing seed usage** → bug or prompt drift; the agent should reach for `language-choice` here.
+- **Tasks without decision refs** → regression. Every task must link to a DR.
+- **Validation failures** → regression. The graph must validate.
+
+## What this benchmark exercises
+
+| Surface | Coverage |
+|---|---|
+| Phase machine | All five transitions: intake → scoping → deciding → decomposing → handing-off → handed-off |
+| Seed library | At least one `dr_seed_load` (language-choice) |
+| Decision lifecycle | propose → update with position + argument → accept (no review under poc preset) |
+| Task graph | Multi-node dependency chain with decision_refs |
+| Gates | `min_tasks=3`, `max_task_estimate_hours=16`, `require_human_signoff_phases=['handing-off']` |
+| Render | Markdown per record + static HTML index |
+| Handoff | Filesystem path (Linear path is exercised by separate live test) |
diff --git a/benchmarks/roguelike-ai-poc/reference/decisions/0001-choose-the-implementation-language.json b/benchmarks/roguelike-ai-poc/reference/decisions/0001-choose-the-implementation-language.json
@@ -0,0 +1,115 @@
+{
+  "id": "0001-choose-the-implementation-language",
+  "number": 1,
+  "slug": "choose-the-implementation-language",
+  "title": "Choose the implementation language",
+  "status": "accepted",
+  "template_variant": "architecture",
+  "created_at": "2026-05-17T04:13:38.681Z",
+  "updated_at": "2026-05-17T04:13:38.685Z",
+  "summary": "Decide the primary implementation language for the project.",
+  "issue": "Every other foundational decision (runtime, package manager, framework choices, testing tools) flows from the language choice. Picking this early and explicitly avoids drift.",
+  "assumptions": [
+    "Team has existing language strengths to lean on.",
+    "Project lifespan is long enough that hiring and onboarding matter.",
+    "Ecosystem maturity matters for the project's domain."
+  ],
+  "constraints": [
+    "Team's current expertise.",
+    "Target runtime environments (browser, server, native, embedded).",
+    "Performance and memory budgets.",
+    "Licensing or compliance restrictions on language ecosystems."
+  ],
+  "positions": [
+    {
+      "title": "TypeScript",
+      "description": "Strongly typed JavaScript. Best for full-stack web work, ubiquitous tooling.",
+      "pros": [
+        "Ubiquitous in web",
+        "Strong types catch errors early",
+        "Massive ecosystem",
+        "Frontend/backend code sharing"
+      ],
+      "cons": [
+        "Build step overhead",
+        "Type system can be over-engineered",
+        "Slower than native languages for hot paths"
+      ],
+      "links": []
+    },
+    {
+      "title": "Python",
+      "description": "Dynamic, batteries-included. Best for data work, scripting, ML, fast prototypes.",
+      "pros": [
+        "Excellent ML/data ecosystem",
+        "Fast to write",
+        "Readable",
+        "Huge stdlib"
+      ],
+      "cons": [
+        "Slow runtime without C extensions",
+        "GIL limits concurrency",
+        "Dynamic typing → runtime errors"
+      ],
+      "links": []
+    },
+    {
+      "title": "Go",
+      "description": "Statically typed, compiled, built for concurrent services.",
+      "pros": [
+        "Simple language",
+        "Single binary deployment",
+        "Strong concurrency primitives",
+        "Fast compile times"
+      ],
+      "cons": [
+        "Generics still maturing",
+        "Verbose error handling",
+        "Less rich third-party ecosystem than JS/Python"
+      ],
+      "links": []
+    },
+    {
+      "title": "Rust",
+      "description": "Memory-safe systems language. Best for performance-critical or systems work.",
+      "pros": [
+        "No GC, predictable performance",
+        "Memory safety",
+        "Excellent tooling (cargo)",
+        "Strong types"
+      ],
+      "cons": [
+        "Steep learning curve",
+        "Slower to ship initial features",
+        "Compile times can be long"
+      ],
+      "links": []
+    }
+  ],
+  "opinions": [],
+  "argument": "Python is fastest to write for a single-script game-loop POC. The OpenAI SDK + a tiny terminal renderer fit naturally; no build step or transpile loop slows iteration. Team is comfortable with Python and the project never needs to leave a single repo.",
+  "selected_position": "Python",
+  "implications": [
+    "Use the official openai Python SDK for agent calls.",
+    "Single-file or small-module layout; no package manager beyond pip/uv.",
+    "Pin to Python 3.11+ for ergonomic match-statement parsing of agent actions."
+  ],
+  "depends_on": [],
+  "related_decisions": [],
+  "related_artifacts": [],
+  "review": [],
+  "sign_off": {
+    "by": "human",
+    "actor": "kj",
+    "at": "2026-05-17T04:13:38.685Z",
+    "notes": "poc preset, no review required"
+  },
+  "seed_origin": "language-choice",
+  "tags": [
+    "foundation",
+    "poc",
+    "foundation",
+    "architecture",
+    "stack"
+  ]
+}