Phase 2 of the work in #39: now that run_final_eval --budgets B1,B2,B3,... reuses compute_scored_state across budgets, the workflow can drop the budget matrix axis and let each cell sweep all budgets in one heavy pass.
Current state
matrix:
method: [ppr, ego, bm25, aider]
budget: [-1, 0, 8000, 16000, 32000, 64000, 128000] # this axis
depth: [-1, 0, 1, 2, 3, 4]
test_set: [contextbench_verified, polybench500, swebench_verified]
After excludes ≈ 162 cells × ~500 instances. Each cell pays the heavy phase 7×.
Proposed state
matrix:
method: [ppr, ego, bm25, aider]
depth: [-1, 0, 1, 2, 3, 4]
test_set: [contextbench_verified, polybench500, swebench_verified]
≈ 24 cells. Each cell invokes:
python -m benchmarks.run_final_eval \
--baseline diffctx \
--winner /tmp/winner.json \
--manifests-dir /tmp/manifest_one \
--budgets -1,0,8000,16000,32000,64000,128000 \
--workers 10 --out "${CELL_DIR}"
(bm25 / aider: omit -1; aider: omit --budgets and keep one budget per cell.)
Dependent changes
benchmarks/aggregate_sweep.py — currently expects cell-<method>-b<budget>-L<depth>-<test_set>/<test_set>.checkpoint.jsonl. Needs to also accept cell-<method>-L<depth>-<test_set>/<test_set>_budget_sweep/[L<depth>/]b<budget>.checkpoint.jsonl and emit one row per (cell, budget) pair.
benchmarks/cell_metrics.py — needs to handle multi-budget cell output (one summary per budget rather than one per cell).
.github/workflows/bench-sweep.yml — matrix collapse + per-method budget list + artifact naming change (cell-<method>-L<depth>-<test_set>).
Acceptance
- Full sweep wall-time < 90 min on Hetzner CCX63 (vs current ~6 h).
- Byte-identical per-(method, budget, depth, test_set) JSONL rows in
grand_summary.json between before/after, on the 5-instance smoke manifest.
- Existing partial-sweep artifacts on
bench-results/sweep branch remain readable (_parse_artifact legacy fallback).
Phase 2 of the work in #39: now that
run_final_eval --budgets B1,B2,B3,...reusescompute_scored_stateacross budgets, the workflow can drop thebudgetmatrix axis and let each cell sweep all budgets in one heavy pass.Current state
After excludes ≈ 162 cells × ~500 instances. Each cell pays the heavy phase 7×.
Proposed state
≈ 24 cells. Each cell invokes:
python -m benchmarks.run_final_eval \ --baseline diffctx \ --winner /tmp/winner.json \ --manifests-dir /tmp/manifest_one \ --budgets -1,0,8000,16000,32000,64000,128000 \ --workers 10 --out "${CELL_DIR}"(bm25 / aider: omit -1; aider: omit
--budgetsand keep one budget per cell.)Dependent changes
benchmarks/aggregate_sweep.py— currently expectscell-<method>-b<budget>-L<depth>-<test_set>/<test_set>.checkpoint.jsonl. Needs to also acceptcell-<method>-L<depth>-<test_set>/<test_set>_budget_sweep/[L<depth>/]b<budget>.checkpoint.jsonland emit one row per (cell, budget) pair.benchmarks/cell_metrics.py— needs to handle multi-budget cell output (one summary per budget rather than one per cell)..github/workflows/bench-sweep.yml— matrix collapse + per-method budget list + artifact naming change (cell-<method>-L<depth>-<test_set>).Acceptance
grand_summary.jsonbetween before/after, on the 5-instance smoke manifest.bench-results/sweepbranch remain readable (_parse_artifactlegacy fallback).