fix(auto-routing): stabilize decider routing benchmarks by iscekic · Pull Request #4046 · Kilo-Org/cloud

iscekic · 2026-06-16T12:05:35Z

Summary

Bound decider benchmark fanout by sharding each model/repetition into as many stable chunk lanes as fit under the configured live-container budget.
Raise the benchmark runner container cap and queue consumer concurrency to 100, while rejecting impossible decider configs where model repetitions alone exceed the container budget.
Reuse one benchmark container instance per run/model/repetition/shard, explicitly destroy terminal shard containers best-effort, and make retries resume from persisted case rows.
Harden chunk chaining so partially persisted next chunks are re-enqueued instead of stranding failed/DLQ leftovers.
Replace difficulty-tier routing with classifier taxonomy routes (taskType/subtaskType) across the decision contract, benchmark summaries, D1 schema, routing table builder, admin view, and worker decision engine.
Rank above-threshold decider candidates by lowest avgCostUsd / accuracy, with higher accuracy as the tie-breaker.
Expand decider benchmark coverage from 76 to 180 golden cases so each taxonomy pair has exactly 10 cases.

Verification

Local decider E2E against the tmux backend before the taxonomy-route contract change: decider-2026-06-16T14-05-32-264Z completed 76/76 cases and published the routing table.
Local decider E2E restarted after the shard/review fixes: decider-2026-06-17T10-23-23-333Z; initial D1 progress reached 3/180 case rows with 0 errors and 0 timeouts. Local Wrangler queue delivery is processing batches as 1/1, so this dev run is expected to be much slower than production fanout.

Visual Changes

N/A

Reviewer Notes

This intentionally breaks the unreleased auto-routing benchmark contract: D1 columns and published routing tables now use route_key/routes instead of tier/tiers. The existing squashed D1 baseline migration was regenerated because the benchmark database will be truncated before rollout.

For a 10-model, 3-repetition prod decider run, shard planning creates 3 shard lanes per model/repetition: 10 × 3 × 3 = 90 live container identities under the 100-container cap. Each shard then advances through its assigned chunks sequentially using the same container identity.

The main risk areas are benchmark finalization/publish semantics, route coverage for every classifier taxonomy pair, and downstream consumers expecting the old tier decision field.

kilo-code-bot · 2026-06-16T12:07:22Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

Previous suggestions on run.ts (destroy failure blocking finalization, dedup check stranding partial chunks) have been resolved with this commit. New tests cover both edge cases.

Files Reviewed (2 incremental files)

services/auto-routing-benchmark/src/run.ts — Fixed destroy catch + dedup threshold
services/auto-routing-benchmark/src/run-process-job.test.ts — Added tests for partial re-enqueue and destroy failure resilience

Previous Review Summaries (7 snapshots, latest commit 7f0f81a)

Current summary above is authoritative. Previous snapshots are kept for context only.

Previous review (commit `7f0f81a`)

Status: 2 Issues Found | Recommendation: Address before merge

Executive Summary

Incremental commit introduces shard-based decider fan-out with bounded live-container lanes (up to 100). Two prior suggestions on run.ts remain unaddressed.

Overview

Severity	Count
CRITICAL	0
WARNING	0
SUGGESTION	2

Issue Details (click to expand)

SUGGESTION

File	Line	Issue
`services/auto-routing-benchmark/src/run.ts`	706	Destroy failure on terminal chunk prevents run finalization — container auto-sleeps after 2m, so destroy should be best-effort
`services/auto-routing-benchmark/src/run.ts`	731	Dedup check in `enqueueNextDeciderChunkIfNeeded` may leave a run stuck if a partially-started next chunk message went to DLQ and its case results exist but were incomplete

Files Reviewed (11 incremental files)

apps/web/src/app/admin/auto-routing/BenchmarksSection.test.ts — default value test
apps/web/src/app/admin/auto-routing/BenchmarksSection.tsx — UI maxConcurrency 16→100
packages/auto-routing-contracts/src/benchmark.ts — schema maxConcurrency 16→100
packages/auto-routing-contracts/src/contracts.test.ts — schema acceptance test
services/auto-routing-benchmark/README.md — docs for shard lanes, 180 cases
services/auto-routing-benchmark/src/admin.test.ts — shard fan-out test coverage
services/auto-routing-benchmark/src/admin.ts — BenchmarkRunConfigError → 400
services/auto-routing-benchmark/src/run-process-job.test.ts — shard container naming tests
services/auto-routing-benchmark/src/run.test.ts — computeDeciderShardCount tests
services/auto-routing-benchmark/src/run.ts — core shard logic, container budget validation
services/auto-routing-benchmark/wrangler.jsonc — max_instances & max_concurrency → 100

Fix these issues in Kilo Cloud

Previous review (commit `cbd3894`)

Status: 2 Issues Found | Recommendation: Address before merge

Executive Summary

Incremental commit adds 68 golden decider cases for 10-case-per-taxonomy-pair coverage (112→180). No new issues. Two prior suggestions on run.ts remain unaddressed.

Overview

Severity	Count
CRITICAL	0
WARNING	0
SUGGESTION	2

Issue Details (click to expand)

SUGGESTION

File	Line	Issue
`services/auto-routing-benchmark/src/run.ts`	613	Destroy failure on terminal chunk prevents run finalization — container auto-sleeps after 2m, so destroy should be best-effort
`services/auto-routing-benchmark/src/run.ts`	635	Dedup check in `enqueueNextDeciderChunkIfNeeded` may leave a run stuck if a partially-started next chunk message went to DLQ and its case results exist but were incomplete

Files Reviewed (2 incremental files)

services/auto-routing-benchmark/src/datasets/decider-cases.ts — 68 new cases
services/auto-routing-benchmark/src/datasets/decider-cases.test.ts — test expectation updates

Fix these issues in Kilo Cloud

Previous review (commit `13dd726`)

Status: 2 Issues Found | Recommendation: Address before merge

Executive Summary

Major architecture change replacing difficulty-tier heuristic routing with direct taxonomy-route key routing (18 taskType/subtaskType pairs). Prior suggestions on run.ts remain unaddressed in this increment.

Overview

Severity	Count
CRITICAL	0
WARNING	0
SUGGESTION	2

Issue Details (click to expand)

SUGGESTION

File	Line	Issue
`services/auto-routing-benchmark/src/run.ts`	613	Destroy failure on terminal chunk prevents run finalization — container auto-sleeps after 2m, so destroy should be best-effort
`services/auto-routing-benchmark/src/run.ts`	635	Dedup check in `enqueueNextDeciderChunkIfNeeded` may leave a run stuck if a partially-started next chunk message went to DLQ and its case results exist but were incomplete

Files Reviewed (35 files)

apps/web/src/app/admin/auto-routing/BenchmarksSection.tsx
apps/web/src/app/api/openrouter/[...path]/route.test.ts
apps/web/src/lib/ai-gateway/auto-model/resolution.test.ts
apps/web/src/lib/ai-gateway/auto-routing-decision.test.ts
packages/auto-routing-contracts/src/benchmark.ts
packages/auto-routing-contracts/src/index.ts
packages/auto-routing-contracts/src/reasoning.ts (new)
packages/auto-routing-contracts/src/routing-table.test.ts
packages/auto-routing-contracts/src/routing-table.ts
packages/auto-routing-contracts/src/taxonomy.ts (new)
packages/auto-routing-contracts/src/tiers.test.ts (deleted)
packages/auto-routing-contracts/src/tiers.ts (deleted)
services/auto-routing-benchmark/README.md
services/auto-routing-benchmark/migrations/0000_absent_wallow.sql
services/auto-routing-benchmark/migrations/meta/0000_snapshot.json
services/auto-routing-benchmark/migrations/meta/_journal.json
services/auto-routing-benchmark/src/admin.test.ts
services/auto-routing-benchmark/src/datasets/decider-cases.test.ts
services/auto-routing-benchmark/src/datasets/decider-cases.ts
services/auto-routing-benchmark/src/db-replace-summaries.test.ts
services/auto-routing-benchmark/src/db-save-routing-table.test.ts
services/auto-routing-benchmark/src/db-schema.ts
services/auto-routing-benchmark/src/db.test.ts
services/auto-routing-benchmark/src/db.ts
services/auto-routing-benchmark/src/grading.ts
services/auto-routing-benchmark/src/routing-table-builder.test.ts
services/auto-routing-benchmark/src/routing-table-builder.ts
services/auto-routing-benchmark/src/run.test.ts
services/auto-routing-benchmark/src/run.ts — 2 issues
services/auto-routing-benchmark/src/winner.ts
services/auto-routing/src/decide.ts
services/auto-routing/src/decision-cache.ts
services/auto-routing/src/decision-engine.test.ts
services/auto-routing/src/decision-engine.ts
services/auto-routing/src/index.test.ts
services/auto-routing/src/routing-table.test.ts

Fix these issues in Kilo Cloud

Previous review (commit `9d48db6`)

Status: 2 Issues Found | Recommendation: Address before merge

Executive Summary

Incremental commit adds D1 batch-size safety for routing table candidate inserts (no new issues). Two prior suggestions on run.ts — destroy failure on terminal chunk and dedup-driven run-stuck risk — remain unaddressed.

Overview

Severity	Count
CRITICAL	0
WARNING	0
SUGGESTION	2

Issue Details (click to expand)

SUGGESTION

File	Line	Issue
`services/auto-routing-benchmark/src/run.ts`	607	Destroy failure on terminal chunk prevents run finalization — container auto-sleeps after 2m, so destroy should be best-effort
`services/auto-routing-benchmark/src/run.ts`	629	Dedup check in `enqueueNextDeciderChunkIfNeeded` may leave a run stuck if a partially-started next chunk message went to DLQ and its case results exist but were incomplete

Files Reviewed (11 files)

services/auto-routing-benchmark/src/admin.test.ts
services/auto-routing-benchmark/src/bench-runner-container.ts
services/auto-routing-benchmark/src/cli-runner.test.ts
services/auto-routing-benchmark/src/cli-runner.ts
services/auto-routing-benchmark/src/db-save-routing-table.test.ts
services/auto-routing-benchmark/src/db.ts
services/auto-routing-benchmark/src/index.ts
services/auto-routing-benchmark/src/run-process-job.test.ts
services/auto-routing-benchmark/src/run.test.ts
services/auto-routing-benchmark/src/run.ts — 2 issues
services/auto-routing-benchmark/wrangler.jsonc

Fix these issues in Kilo Cloud

Previous review (commit `5612632`)

Status: 2 Issues Found | Recommendation: Address before merge

Executive Summary

The new chunk-chaining logic has a container destroy edge case where a destroy failure on the terminal chunk blocks run finalization and causes unnecessary queue retries.

Overview

Severity	Count
CRITICAL	0
WARNING	0
SUGGESTION	2

Issue Details (click to expand)

SUGGESTION

File	Line	Issue
`services/auto-routing-benchmark/src/run.ts`	607	Destroy failure on terminal chunk prevents run finalization — container auto-sleeps after 2m, so destroy should be best-effort
`services/auto-routing-benchmark/src/run.ts`	629	Dedup check in `enqueueNextDeciderChunkIfNeeded` may leave a run stuck if a partially-started next chunk message went to DLQ and its case results exist but were incomplete

Files Reviewed (10 files)

services/auto-routing-benchmark/src/admin.test.ts
services/auto-routing-benchmark/src/bench-runner-container.ts
services/auto-routing-benchmark/src/cli-runner.test.ts
services/auto-routing-benchmark/src/cli-runner.ts
services/auto-routing-benchmark/src/db.ts
services/auto-routing-benchmark/src/index.ts
services/auto-routing-benchmark/src/run-process-job.test.ts
services/auto-routing-benchmark/src/run.test.ts
services/auto-routing-benchmark/src/run.ts — 2 issues
services/auto-routing-benchmark/wrangler.jsonc

Fix these issues in Kilo Cloud

Previous review (commit `4f80a25`)

Status: 1 Issue Found | Recommendation: Address before merge

Executive Summary

New chunk-chaining logic has a dedup edge case where partially-started next-chunk messages that hit DLQ could leave a run stuck and never finalizing.

Overview

Severity	Count
CRITICAL	0
WARNING	0
SUGGESTION	1

Issue Details (click to expand)

SUGGESTION

File	Line	Issue
`services/auto-routing-benchmark/src/run.ts`	625	Dedup check may leave run stuck if partially-started next chunk goes to DLQ

Files Reviewed (8 files)

services/auto-routing-benchmark/src/admin.test.ts
services/auto-routing-benchmark/src/bench-runner-container.ts
services/auto-routing-benchmark/src/cli-runner.ts
services/auto-routing-benchmark/src/db.ts
services/auto-routing-benchmark/src/run-process-job.test.ts
services/auto-routing-benchmark/src/run.test.ts
services/auto-routing-benchmark/src/run.ts — 1 issue
services/auto-routing-benchmark/wrangler.jsonc

Fix these issues in Kilo Cloud

Previous review (commit `6c13bd1`)

Status: No Issues Found | Recommendation: Merge

Executive Summary

Well-targeted retry logic for Cloudflare Container capacity/startup failures in the auto-routing benchmark decider queue, with clean test coverage validating that capacity errors propagate for retry instead of recording as failed case rows.

Files Reviewed (4 files)

services/auto-routing-benchmark/src/cli-runner.ts
services/auto-routing-benchmark/src/run-process-job.test.ts
services/auto-routing-benchmark/src/run.ts
services/auto-routing-benchmark/wrangler.jsonc

_{Reviewed by deepseek-v4-pro-20260423 · 189,933 tokens}

_{Review guidance: REVIEW.md from base branch main}

fix(auto-routing): retry container capacity failures

6c13bd1

iscekic self-assigned this Jun 16, 2026

fix(auto-routing): chain decider benchmark chunks

4f80a25

iscekic changed the title ~~fix(auto-routing): retry container capacity failures~~ fix(auto-routing): bound decider benchmark containers Jun 16, 2026

kilo-code-bot Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread services/auto-routing-benchmark/src/run.ts Outdated

fix(auto-routing): destroy completed benchmark containers

5612632

kilo-code-bot Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread services/auto-routing-benchmark/src/run.ts Outdated

iscekic added 2 commits June 16, 2026 20:41

fix(auto-routing): chunk routing table candidate inserts

9d48db6

fix(auto-routing): route by taxonomy pair

13dd726

iscekic changed the title ~~fix(auto-routing): bound decider benchmark containers~~ fix(auto-routing): stabilize decider routing benchmarks Jun 17, 2026

iscekic added 3 commits June 17, 2026 11:45

test(auto-routing): expand decider taxonomy coverage

cbd3894

fix(auto-routing): shard benchmark containers

7f0f81a

fix(auto-routing): harden benchmark chunk retries

88195f5

iscekic requested a review from RSO June 17, 2026 10:23

iscekic enabled auto-merge (squash) June 17, 2026 10:25

RSO approved these changes Jun 17, 2026

View reviewed changes

iscekic merged commit 9bf7967 into main Jun 17, 2026
59 checks passed

iscekic deleted the fix/auto-routing-container-retry branch June 17, 2026 10:29

Conversation

iscekic commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Visual Changes

Reviewer Notes

Uh oh!

kilo-code-bot Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Executive Summary

Previous review (commit 7f0f81a)

Executive Summary

Overview

SUGGESTION

Previous review (commit cbd3894)

Executive Summary

Overview

SUGGESTION

Previous review (commit 13dd726)

Executive Summary

Overview

SUGGESTION

Previous review (commit 9d48db6)

Executive Summary

Overview

SUGGESTION

Previous review (commit 5612632)

Executive Summary

Overview

SUGGESTION

Previous review (commit 4f80a25)

Executive Summary

Overview

SUGGESTION

Previous review (commit 6c13bd1)

Executive Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iscekic commented Jun 16, 2026 •

edited

Loading

kilo-code-bot Bot commented Jun 16, 2026 •

edited

Loading

Previous review (commit `7f0f81a`)

Previous review (commit `cbd3894`)

Previous review (commit `13dd726`)

Previous review (commit `9d48db6`)

Previous review (commit `5612632`)

Previous review (commit `4f80a25`)

Previous review (commit `6c13bd1`)