feat(auto-routing): auto-sync decider benchmark models#4078
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Executive SummaryIncremental (7 files): Replaces unsafe Incremental Files Reviewed (7 files)
Carried Forward (unchanged since last review, 20 files)
Previous Review Summaries (3 snapshots, latest commit 9a3bf34)Current summary above is authoritative. Previous snapshots are kept for context only. Previous review (commit 9a3bf34)Status: No Issues Found | Recommendation: Merge Executive SummaryIncremental (2 files): Removes the 5-attempt minimum gate from decider candidate filtering, allowing models with fewer benchmark attempts to be included as auto-decider candidates. Test updated to cover the single-attempt case. Incremental Files Reviewed (2 files)
Carried Forward (unchanged since last review, 20 files)
Previous review (commit 4b5212f)Status: No Issues Found | Recommendation: Merge Executive SummaryIncremental changes add configurable auto decider cost bounds and fix D1 SQL variable limits with summary chunking — clean additions with comprehensive test coverage across contracts, API, sync, and UI layers. Incremental Files Reviewed (17 changed files)
Carried Forward (unchanged since last review, 3 files)
Previous review (commit 5dd959a)Status: No Issues Found | Recommendation: Merge Executive SummaryWell-structured feature that adds automatic decider benchmark model syncing from Kilo Bench cost data with proper backward compatibility, exclusion management, and comprehensive test coverage across all layers. Files Reviewed (20 files)
Reviewed by deepseek-v4-pro-20260423 · 437,576 tokens Review guidance: REVIEW.md from base branch |
Summary
Adds automatic decider benchmark candidate syncing from Kilo Bench cost data. The benchmark worker now runs a daily scheduled sync, persists synced auto decider models and exclusions in D1, preserves per-model reasoning effort, and starts a decider benchmark when the effective model set changes.
Adds configurable auto-decider cost bounds to the benchmark config, defaulting to $15-$25 for existing configs. The web candidate endpoint filters terminal-bench models using the saved bounds, and the admin UI exposes compact controls for changing the min/max average run cost.
Also chunks carried summary inserts when starting a run so larger carried decider result sets stay under D1 bind limits.
Verification
Ran the local auto-routing stack in tmux with Next.js, auto-routing, and auto-routing-benchmark. Seeded local model_stats with two in-band terminal-bench models and one out-of-band control, seeded D1 prior decider summaries, saved benchmark config with $12-$24 auto bounds, and triggered the local scheduled handler. The sync added only the two in-band models, started and completed a decider run from carried summaries, and published a routing table with all 18 routes.
Visual Changes
N/A
Reviewer Notes
The local E2E uses carried prior summaries so the scheduled rerun exercises the start/complete/publish path without invoking real CLI benchmark containers or consuming benchmark credits.