perf(bench): collapse bench-sweep.yml matrix on budget axis

Phase 2 of the work in #39: now that `run_final_eval --budgets B1,B2,B3,...` reuses `compute_scored_state` across budgets, the workflow can drop the `budget` matrix axis and let each cell sweep all budgets in one heavy pass.

## Current state

```yaml
matrix:
  method: [ppr, ego, bm25, aider]
  budget: [-1, 0, 8000, 16000, 32000, 64000, 128000]   # this axis
  depth: [-1, 0, 1, 2, 3, 4]
  test_set: [contextbench_verified, polybench500, swebench_verified]
```

After excludes ≈ 162 cells × ~500 instances. Each cell pays the heavy phase 7×.

## Proposed state

```yaml
matrix:
  method: [ppr, ego, bm25, aider]
  depth: [-1, 0, 1, 2, 3, 4]
  test_set: [contextbench_verified, polybench500, swebench_verified]
```

≈ 24 cells. Each cell invokes:

```bash
python -m benchmarks.run_final_eval \
    --baseline diffctx \
    --winner /tmp/winner.json \
    --manifests-dir /tmp/manifest_one \
    --budgets -1,0,8000,16000,32000,64000,128000 \
    --workers 10 --out "${CELL_DIR}"
```

(bm25 / aider: omit -1; aider: omit `--budgets` and keep one budget per cell.)

## Dependent changes

1. **`benchmarks/aggregate_sweep.py`** — currently expects `cell-<method>-b<budget>-L<depth>-<test_set>/<test_set>.checkpoint.jsonl`. Needs to also accept `cell-<method>-L<depth>-<test_set>/<test_set>_budget_sweep/[L<depth>/]b<budget>.checkpoint.jsonl` and emit one row per (cell, budget) pair.
2. **`benchmarks/cell_metrics.py`** — needs to handle multi-budget cell output (one summary per budget rather than one per cell).
3. **`.github/workflows/bench-sweep.yml`** — matrix collapse + per-method budget list + artifact naming change (`cell-<method>-L<depth>-<test_set>`).

## Acceptance

- Full sweep wall-time < 90 min on Hetzner CCX63 (vs current ~6 h).
- Byte-identical per-(method, budget, depth, test_set) JSONL rows in `grand_summary.json` between before/after, on the 5-instance smoke manifest.
- Existing partial-sweep artifacts on `bench-results/sweep` branch remain readable (`_parse_artifact` legacy fallback).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bench): collapse bench-sweep.yml matrix on budget axis #52

Current state

Proposed state

Dependent changes

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(bench): collapse bench-sweep.yml matrix on budget axis #52

Description

Current state

Proposed state

Dependent changes

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions