Skip to content

fix(auto-routing): stabilize decider routing benchmarks#4046

Merged
iscekic merged 8 commits into
mainfrom
fix/auto-routing-container-retry
Jun 17, 2026
Merged

fix(auto-routing): stabilize decider routing benchmarks#4046
iscekic merged 8 commits into
mainfrom
fix/auto-routing-container-retry

Conversation

@iscekic

@iscekic iscekic commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Bound decider benchmark fanout by sharding each model/repetition into as many stable chunk lanes as fit under the configured live-container budget.
  • Raise the benchmark runner container cap and queue consumer concurrency to 100, while rejecting impossible decider configs where model repetitions alone exceed the container budget.
  • Reuse one benchmark container instance per run/model/repetition/shard, explicitly destroy terminal shard containers best-effort, and make retries resume from persisted case rows.
  • Harden chunk chaining so partially persisted next chunks are re-enqueued instead of stranding failed/DLQ leftovers.
  • Replace difficulty-tier routing with classifier taxonomy routes (taskType/subtaskType) across the decision contract, benchmark summaries, D1 schema, routing table builder, admin view, and worker decision engine.
  • Rank above-threshold decider candidates by lowest avgCostUsd / accuracy, with higher accuracy as the tie-breaker.
  • Expand decider benchmark coverage from 76 to 180 golden cases so each taxonomy pair has exactly 10 cases.

Verification

  • Local decider E2E against the tmux backend before the taxonomy-route contract change: decider-2026-06-16T14-05-32-264Z completed 76/76 cases and published the routing table.
  • Local decider E2E restarted after the shard/review fixes: decider-2026-06-17T10-23-23-333Z; initial D1 progress reached 3/180 case rows with 0 errors and 0 timeouts. Local Wrangler queue delivery is processing batches as 1/1, so this dev run is expected to be much slower than production fanout.

Visual Changes

N/A

Reviewer Notes

This intentionally breaks the unreleased auto-routing benchmark contract: D1 columns and published routing tables now use route_key/routes instead of tier/tiers. The existing squashed D1 baseline migration was regenerated because the benchmark database will be truncated before rollout.

For a 10-model, 3-repetition prod decider run, shard planning creates 3 shard lanes per model/repetition: 10 × 3 × 3 = 90 live container identities under the 100-container cap. Each shard then advances through its assigned chunks sequentially using the same container identity.

The main risk areas are benchmark finalization/publish semantics, route coverage for every classifier taxonomy pair, and downstream consumers expecting the old tier decision field.

@iscekic iscekic self-assigned this Jun 16, 2026
@kilo-code-bot

kilo-code-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

Previous suggestions on run.ts (destroy failure blocking finalization, dedup check stranding partial chunks) have been resolved with this commit. New tests cover both edge cases.

Files Reviewed (2 incremental files)
  • services/auto-routing-benchmark/src/run.ts — Fixed destroy catch + dedup threshold
  • services/auto-routing-benchmark/src/run-process-job.test.ts — Added tests for partial re-enqueue and destroy failure resilience
Previous Review Summaries (7 snapshots, latest commit 7f0f81a)

Current summary above is authoritative. Previous snapshots are kept for context only.

Previous review (commit 7f0f81a)

Status: 2 Issues Found | Recommendation: Address before merge

Executive Summary

Incremental commit introduces shard-based decider fan-out with bounded live-container lanes (up to 100). Two prior suggestions on run.ts remain unaddressed.

Overview

Severity Count
CRITICAL 0
WARNING 0
SUGGESTION 2
Issue Details (click to expand)

SUGGESTION

File Line Issue
services/auto-routing-benchmark/src/run.ts 706 Destroy failure on terminal chunk prevents run finalization — container auto-sleeps after 2m, so destroy should be best-effort
services/auto-routing-benchmark/src/run.ts 731 Dedup check in enqueueNextDeciderChunkIfNeeded may leave a run stuck if a partially-started next chunk message went to DLQ and its case results exist but were incomplete
Files Reviewed (11 incremental files)
  • apps/web/src/app/admin/auto-routing/BenchmarksSection.test.ts — default value test
  • apps/web/src/app/admin/auto-routing/BenchmarksSection.tsx — UI maxConcurrency 16→100
  • packages/auto-routing-contracts/src/benchmark.ts — schema maxConcurrency 16→100
  • packages/auto-routing-contracts/src/contracts.test.ts — schema acceptance test
  • services/auto-routing-benchmark/README.md — docs for shard lanes, 180 cases
  • services/auto-routing-benchmark/src/admin.test.ts — shard fan-out test coverage
  • services/auto-routing-benchmark/src/admin.ts — BenchmarkRunConfigError → 400
  • services/auto-routing-benchmark/src/run-process-job.test.ts — shard container naming tests
  • services/auto-routing-benchmark/src/run.test.ts — computeDeciderShardCount tests
  • services/auto-routing-benchmark/src/run.ts — core shard logic, container budget validation
  • services/auto-routing-benchmark/wrangler.jsonc — max_instances & max_concurrency → 100

Fix these issues in Kilo Cloud

Previous review (commit cbd3894)

Status: 2 Issues Found | Recommendation: Address before merge

Executive Summary

Incremental commit adds 68 golden decider cases for 10-case-per-taxonomy-pair coverage (112→180). No new issues. Two prior suggestions on run.ts remain unaddressed.

Overview

Severity Count
CRITICAL 0
WARNING 0
SUGGESTION 2
Issue Details (click to expand)

SUGGESTION

File Line Issue
services/auto-routing-benchmark/src/run.ts 613 Destroy failure on terminal chunk prevents run finalization — container auto-sleeps after 2m, so destroy should be best-effort
services/auto-routing-benchmark/src/run.ts 635 Dedup check in enqueueNextDeciderChunkIfNeeded may leave a run stuck if a partially-started next chunk message went to DLQ and its case results exist but were incomplete
Files Reviewed (2 incremental files)
  • services/auto-routing-benchmark/src/datasets/decider-cases.ts — 68 new cases
  • services/auto-routing-benchmark/src/datasets/decider-cases.test.ts — test expectation updates

Fix these issues in Kilo Cloud

Previous review (commit 13dd726)

Status: 2 Issues Found | Recommendation: Address before merge

Executive Summary

Major architecture change replacing difficulty-tier heuristic routing with direct taxonomy-route key routing (18 taskType/subtaskType pairs). Prior suggestions on run.ts remain unaddressed in this increment.

Overview

Severity Count
CRITICAL 0
WARNING 0
SUGGESTION 2
Issue Details (click to expand)

SUGGESTION

File Line Issue
services/auto-routing-benchmark/src/run.ts 613 Destroy failure on terminal chunk prevents run finalization — container auto-sleeps after 2m, so destroy should be best-effort
services/auto-routing-benchmark/src/run.ts 635 Dedup check in enqueueNextDeciderChunkIfNeeded may leave a run stuck if a partially-started next chunk message went to DLQ and its case results exist but were incomplete
Files Reviewed (35 files)
  • apps/web/src/app/admin/auto-routing/BenchmarksSection.tsx
  • apps/web/src/app/api/openrouter/[...path]/route.test.ts
  • apps/web/src/lib/ai-gateway/auto-model/resolution.test.ts
  • apps/web/src/lib/ai-gateway/auto-routing-decision.test.ts
  • packages/auto-routing-contracts/src/benchmark.ts
  • packages/auto-routing-contracts/src/index.ts
  • packages/auto-routing-contracts/src/reasoning.ts (new)
  • packages/auto-routing-contracts/src/routing-table.test.ts
  • packages/auto-routing-contracts/src/routing-table.ts
  • packages/auto-routing-contracts/src/taxonomy.ts (new)
  • packages/auto-routing-contracts/src/tiers.test.ts (deleted)
  • packages/auto-routing-contracts/src/tiers.ts (deleted)
  • services/auto-routing-benchmark/README.md
  • services/auto-routing-benchmark/migrations/0000_absent_wallow.sql
  • services/auto-routing-benchmark/migrations/meta/0000_snapshot.json
  • services/auto-routing-benchmark/migrations/meta/_journal.json
  • services/auto-routing-benchmark/src/admin.test.ts
  • services/auto-routing-benchmark/src/datasets/decider-cases.test.ts
  • services/auto-routing-benchmark/src/datasets/decider-cases.ts
  • services/auto-routing-benchmark/src/db-replace-summaries.test.ts
  • services/auto-routing-benchmark/src/db-save-routing-table.test.ts
  • services/auto-routing-benchmark/src/db-schema.ts
  • services/auto-routing-benchmark/src/db.test.ts
  • services/auto-routing-benchmark/src/db.ts
  • services/auto-routing-benchmark/src/grading.ts
  • services/auto-routing-benchmark/src/routing-table-builder.test.ts
  • services/auto-routing-benchmark/src/routing-table-builder.ts
  • services/auto-routing-benchmark/src/run.test.ts
  • services/auto-routing-benchmark/src/run.ts — 2 issues
  • services/auto-routing-benchmark/src/winner.ts
  • services/auto-routing/src/decide.ts
  • services/auto-routing/src/decision-cache.ts
  • services/auto-routing/src/decision-engine.test.ts
  • services/auto-routing/src/decision-engine.ts
  • services/auto-routing/src/index.test.ts
  • services/auto-routing/src/routing-table.test.ts

Fix these issues in Kilo Cloud

Previous review (commit 9d48db6)

Status: 2 Issues Found | Recommendation: Address before merge

Executive Summary

Incremental commit adds D1 batch-size safety for routing table candidate inserts (no new issues). Two prior suggestions on run.ts — destroy failure on terminal chunk and dedup-driven run-stuck risk — remain unaddressed.

Overview

Severity Count
CRITICAL 0
WARNING 0
SUGGESTION 2
Issue Details (click to expand)

SUGGESTION

File Line Issue
services/auto-routing-benchmark/src/run.ts 607 Destroy failure on terminal chunk prevents run finalization — container auto-sleeps after 2m, so destroy should be best-effort
services/auto-routing-benchmark/src/run.ts 629 Dedup check in enqueueNextDeciderChunkIfNeeded may leave a run stuck if a partially-started next chunk message went to DLQ and its case results exist but were incomplete
Files Reviewed (11 files)
  • services/auto-routing-benchmark/src/admin.test.ts
  • services/auto-routing-benchmark/src/bench-runner-container.ts
  • services/auto-routing-benchmark/src/cli-runner.test.ts
  • services/auto-routing-benchmark/src/cli-runner.ts
  • services/auto-routing-benchmark/src/db-save-routing-table.test.ts
  • services/auto-routing-benchmark/src/db.ts
  • services/auto-routing-benchmark/src/index.ts
  • services/auto-routing-benchmark/src/run-process-job.test.ts
  • services/auto-routing-benchmark/src/run.test.ts
  • services/auto-routing-benchmark/src/run.ts — 2 issues
  • services/auto-routing-benchmark/wrangler.jsonc

Fix these issues in Kilo Cloud

Previous review (commit 5612632)

Status: 2 Issues Found | Recommendation: Address before merge

Executive Summary

The new chunk-chaining logic has a container destroy edge case where a destroy failure on the terminal chunk blocks run finalization and causes unnecessary queue retries.

Overview

Severity Count
CRITICAL 0
WARNING 0
SUGGESTION 2
Issue Details (click to expand)

SUGGESTION

File Line Issue
services/auto-routing-benchmark/src/run.ts 607 Destroy failure on terminal chunk prevents run finalization — container auto-sleeps after 2m, so destroy should be best-effort
services/auto-routing-benchmark/src/run.ts 629 Dedup check in enqueueNextDeciderChunkIfNeeded may leave a run stuck if a partially-started next chunk message went to DLQ and its case results exist but were incomplete
Files Reviewed (10 files)
  • services/auto-routing-benchmark/src/admin.test.ts
  • services/auto-routing-benchmark/src/bench-runner-container.ts
  • services/auto-routing-benchmark/src/cli-runner.test.ts
  • services/auto-routing-benchmark/src/cli-runner.ts
  • services/auto-routing-benchmark/src/db.ts
  • services/auto-routing-benchmark/src/index.ts
  • services/auto-routing-benchmark/src/run-process-job.test.ts
  • services/auto-routing-benchmark/src/run.test.ts
  • services/auto-routing-benchmark/src/run.ts — 2 issues
  • services/auto-routing-benchmark/wrangler.jsonc

Fix these issues in Kilo Cloud

Previous review (commit 4f80a25)

Status: 1 Issue Found | Recommendation: Address before merge

Executive Summary

New chunk-chaining logic has a dedup edge case where partially-started next-chunk messages that hit DLQ could leave a run stuck and never finalizing.

Overview

Severity Count
CRITICAL 0
WARNING 0
SUGGESTION 1
Issue Details (click to expand)

SUGGESTION

File Line Issue
services/auto-routing-benchmark/src/run.ts 625 Dedup check may leave run stuck if partially-started next chunk goes to DLQ
Files Reviewed (8 files)
  • services/auto-routing-benchmark/src/admin.test.ts
  • services/auto-routing-benchmark/src/bench-runner-container.ts
  • services/auto-routing-benchmark/src/cli-runner.ts
  • services/auto-routing-benchmark/src/db.ts
  • services/auto-routing-benchmark/src/run-process-job.test.ts
  • services/auto-routing-benchmark/src/run.test.ts
  • services/auto-routing-benchmark/src/run.ts — 1 issue
  • services/auto-routing-benchmark/wrangler.jsonc

Fix these issues in Kilo Cloud

Previous review (commit 6c13bd1)

Status: No Issues Found | Recommendation: Merge

Executive Summary

Well-targeted retry logic for Cloudflare Container capacity/startup failures in the auto-routing benchmark decider queue, with clean test coverage validating that capacity errors propagate for retry instead of recording as failed case rows.

Files Reviewed (4 files)
  • services/auto-routing-benchmark/src/cli-runner.ts
  • services/auto-routing-benchmark/src/run-process-job.test.ts
  • services/auto-routing-benchmark/src/run.ts
  • services/auto-routing-benchmark/wrangler.jsonc

Reviewed by deepseek-v4-pro-20260423 · 189,933 tokens

Review guidance: REVIEW.md from base branch main

@iscekic iscekic changed the title fix(auto-routing): retry container capacity failures fix(auto-routing): bound decider benchmark containers Jun 16, 2026
Comment thread services/auto-routing-benchmark/src/run.ts Outdated
Comment thread services/auto-routing-benchmark/src/run.ts Outdated
@iscekic iscekic changed the title fix(auto-routing): bound decider benchmark containers fix(auto-routing): stabilize decider routing benchmarks Jun 17, 2026
@iscekic iscekic requested a review from RSO June 17, 2026 10:23
@iscekic iscekic enabled auto-merge (squash) June 17, 2026 10:25
@iscekic iscekic merged commit 9bf7967 into main Jun 17, 2026
59 checks passed
@iscekic iscekic deleted the fix/auto-routing-container-retry branch June 17, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants