Skip to content

perf(bench): collapse bench-sweep.yml matrix on budget axis #52

@nikolay-e

Description

@nikolay-e

Phase 2 of the work in #39: now that run_final_eval --budgets B1,B2,B3,... reuses compute_scored_state across budgets, the workflow can drop the budget matrix axis and let each cell sweep all budgets in one heavy pass.

Current state

matrix:
  method: [ppr, ego, bm25, aider]
  budget: [-1, 0, 8000, 16000, 32000, 64000, 128000]   # this axis
  depth: [-1, 0, 1, 2, 3, 4]
  test_set: [contextbench_verified, polybench500, swebench_verified]

After excludes ≈ 162 cells × ~500 instances. Each cell pays the heavy phase 7×.

Proposed state

matrix:
  method: [ppr, ego, bm25, aider]
  depth: [-1, 0, 1, 2, 3, 4]
  test_set: [contextbench_verified, polybench500, swebench_verified]

≈ 24 cells. Each cell invokes:

python -m benchmarks.run_final_eval \
    --baseline diffctx \
    --winner /tmp/winner.json \
    --manifests-dir /tmp/manifest_one \
    --budgets -1,0,8000,16000,32000,64000,128000 \
    --workers 10 --out "${CELL_DIR}"

(bm25 / aider: omit -1; aider: omit --budgets and keep one budget per cell.)

Dependent changes

  1. benchmarks/aggregate_sweep.py — currently expects cell-<method>-b<budget>-L<depth>-<test_set>/<test_set>.checkpoint.jsonl. Needs to also accept cell-<method>-L<depth>-<test_set>/<test_set>_budget_sweep/[L<depth>/]b<budget>.checkpoint.jsonl and emit one row per (cell, budget) pair.
  2. benchmarks/cell_metrics.py — needs to handle multi-budget cell output (one summary per budget rather than one per cell).
  3. .github/workflows/bench-sweep.yml — matrix collapse + per-method budget list + artifact naming change (cell-<method>-L<depth>-<test_set>).

Acceptance

  • Full sweep wall-time < 90 min on Hetzner CCX63 (vs current ~6 h).
  • Byte-identical per-(method, budget, depth, test_set) JSONL rows in grand_summary.json between before/after, on the 5-instance smoke manifest.
  • Existing partial-sweep artifacts on bench-results/sweep branch remain readable (_parse_artifact legacy fallback).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions