Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ Full install instructions: [`docs/how-to/install.md`](docs/how-to/install.md). F

## Benchmarks

We use a canonical prompt — an AI-driven roguelike POC — to spot regressions as the system evolves. See [`benchmarks/`](benchmarks/) for the prompt, expected output shape, and a `run.sh` to re-run it.
Canonical prompts for regression-spotting as the system evolves live under [`benchmarks/`](benchmarks/). See that directory for the layout convention.

## Contributing

Expand Down
23 changes: 13 additions & 10 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,26 @@

Canonical prompts we run against the decision-record planning pipeline to catch regressions as the system evolves.

| Benchmark | Prompt | Effort | Purpose |
|---|---|---|---|
| [roguelike-ai-poc](roguelike-ai-poc/) | AI-driven roguelike where the agent plays the game | `poc` | Exercises all five pipeline phases on a small, well-bounded problem. The original dogfood case. |
_(No public benchmarks committed yet. Add new ones as `benchmarks/<name>/` with a `prompt.md`, a `reference/` artifact snapshot, and a `run.sh` runner. See the structure described below.)_

## How to run a benchmark
## Benchmark layout

Each benchmark lives in its own directory:

```
benchmarks/<name>/
├── prompt.md # the exact idea, effort level, and what "good output" looks like
├── reference/ # a baseline artifact snapshot from a canonical run
└── run.sh # one-shot runner that fires the CLI against a fresh tmp dir
```

## How to run

```bash
cd benchmarks/<name>
./run.sh
```

Each benchmark has:

- `prompt.md` — the exact idea, effort level, and what "good output" looks like
- `reference/` — a baseline artifact snapshot from a canonical run
- `run.sh` — one-shot runner that fires the CLI against a fresh tmp dir

## What we look for when comparing runs

Each benchmark's `prompt.md` defines its own success criteria. Generally:
Expand Down
63 changes: 0 additions & 63 deletions benchmarks/roguelike-ai-poc/prompt.md

This file was deleted.

This file was deleted.

This file was deleted.

Loading
Loading