fix(auto-routing): stabilize decider routing benchmarks#4046
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Executive SummaryPrevious suggestions on Files Reviewed (2 incremental files)
Previous Review Summaries (7 snapshots, latest commit 7f0f81a)Current summary above is authoritative. Previous snapshots are kept for context only. Previous review (commit 7f0f81a)Status: 2 Issues Found | Recommendation: Address before merge Executive SummaryIncremental commit introduces shard-based decider fan-out with bounded live-container lanes (up to 100). Two prior suggestions on Overview
Issue Details (click to expand)SUGGESTION
Files Reviewed (11 incremental files)
Fix these issues in Kilo Cloud Previous review (commit cbd3894)Status: 2 Issues Found | Recommendation: Address before merge Executive SummaryIncremental commit adds 68 golden decider cases for 10-case-per-taxonomy-pair coverage (112→180). No new issues. Two prior suggestions on Overview
Issue Details (click to expand)SUGGESTION
Files Reviewed (2 incremental files)
Fix these issues in Kilo Cloud Previous review (commit 13dd726)Status: 2 Issues Found | Recommendation: Address before merge Executive SummaryMajor architecture change replacing difficulty-tier heuristic routing with direct taxonomy-route key routing (18 taskType/subtaskType pairs). Prior suggestions on Overview
Issue Details (click to expand)SUGGESTION
Files Reviewed (35 files)
Fix these issues in Kilo Cloud Previous review (commit 9d48db6)Status: 2 Issues Found | Recommendation: Address before merge Executive SummaryIncremental commit adds D1 batch-size safety for routing table candidate inserts (no new issues). Two prior suggestions on Overview
Issue Details (click to expand)SUGGESTION
Files Reviewed (11 files)
Fix these issues in Kilo Cloud Previous review (commit 5612632)Status: 2 Issues Found | Recommendation: Address before merge Executive SummaryThe new chunk-chaining logic has a container destroy edge case where a destroy failure on the terminal chunk blocks run finalization and causes unnecessary queue retries. Overview
Issue Details (click to expand)SUGGESTION
Files Reviewed (10 files)
Fix these issues in Kilo Cloud Previous review (commit 4f80a25)Status: 1 Issue Found | Recommendation: Address before merge Executive SummaryNew chunk-chaining logic has a dedup edge case where partially-started next-chunk messages that hit DLQ could leave a run stuck and never finalizing. Overview
Issue Details (click to expand)SUGGESTION
Files Reviewed (8 files)
Fix these issues in Kilo Cloud Previous review (commit 6c13bd1)Status: No Issues Found | Recommendation: Merge Executive SummaryWell-targeted retry logic for Cloudflare Container capacity/startup failures in the auto-routing benchmark decider queue, with clean test coverage validating that capacity errors propagate for retry instead of recording as failed case rows. Files Reviewed (4 files)
Reviewed by deepseek-v4-pro-20260423 · 189,933 tokens Review guidance: REVIEW.md from base branch |
Summary
taskType/subtaskType) across the decision contract, benchmark summaries, D1 schema, routing table builder, admin view, and worker decision engine.avgCostUsd / accuracy, with higher accuracy as the tie-breaker.Verification
decider-2026-06-16T14-05-32-264Zcompleted 76/76 cases and published the routing table.decider-2026-06-17T10-23-23-333Z; initial D1 progress reached 3/180 case rows with 0 errors and 0 timeouts. Local Wrangler queue delivery is processing batches as1/1, so this dev run is expected to be much slower than production fanout.Visual Changes
N/A
Reviewer Notes
This intentionally breaks the unreleased auto-routing benchmark contract: D1 columns and published routing tables now use
route_key/routesinstead oftier/tiers. The existing squashed D1 baseline migration was regenerated because the benchmark database will be truncated before rollout.For a 10-model, 3-repetition prod decider run, shard planning creates 3 shard lanes per model/repetition:
10 × 3 × 3 = 90live container identities under the 100-container cap. Each shard then advances through its assigned chunks sequentially using the same container identity.The main risk areas are benchmark finalization/publish semantics, route coverage for every classifier taxonomy pair, and downstream consumers expecting the old
tierdecision field.