[codex] Stage decode cost for pending prefills by jthomson04 · Pull Request #11024 · ai-dynamo/dynamo

jthomson04 · 2026-06-28T00:06:48Z

Summary

Prototype a staged decode-cost model for aggregated KV routing so uncached prompt blocks are not represented in both the prefill and decode terms while prefill is pending.

prefill_cost = active prefill tokens + incoming uncached tokens

decode_cost = unique blocks from decode-active requests
            + cached-prefix blocks from prefill-active requests
            + incoming cached-prefix blocks

The existing configured prefill weighting, overlap credit, and selector formula remain in place; this changes only the decode-block projection supplied to that formula.

Implementation

Add a selector-only block tracker and prompt-membership trie with unique-block semantics.
At admission, book the full prompt in the existing active-block state but only the cached complete-block prefix in staged selector state.
Promote the uncached suffix when prefill completes, and release both trackers on free or expiry.
Include generated output blocks and fractional decay in staged state.
Gate staging on both prefill-token tracking and at least one complete prompt hash. Missing either signal uses the exact legacy projection.
Preserve existing full active blocks, capacity/admission state, queue behavior, published ActiveLoad, and replica event wire format.
Export dynamo_frontend_worker_selector_decode_blocks separately from the existing full active-block gauge.

Disaggregated routing is intentionally unchanged. The prefill path has no active-block hashes, and the decode override still disables prefill tracking and overlap credit, so both bypass staged state and retain full active-block decode cost. The bootstrap and synchronous paths continue through the same common decode override.

At equal prefill/decode weights, a fully cached request and a cold request now tie when otherwise unloaded: the prompt blocks move from the prefill term to the decode term. The online replay cache-affinity fixture therefore sets prefill_load_scale=2 explicitly. The separate deterministic offline replay simulator retains its existing legacy projection in this prototype.

Validation

cargo test -p dynamo-kv-router: 627 passed
cargo test -p dynamo-mocker --lib: 468 passed, 1 pre-existing ignored
cargo test -p dynamo-llm kv_router::prefill_router::tests: 12 passed
cargo test -p dynamo-llm test_worker_load_metrics_pef: passed
cargo check -p dynamo-llm -p dynamo-mocker: passed
cargo clippy -p dynamo-kv-router -p dynamo-llm -p dynamo-mocker --lib -- -D warnings: passed
cargo fmt --all -- --check and git diff --check: passed

Lifecycle coverage includes zero/partial/full overlap, partial final prompt blocks, promotion, shared-prefix deduplication, output-block fractional decay, free, expiry, topology cleanup, replica synchronization, and both configuration fallbacks.

Prototype benchmark

An earlier Dynamo 1.1.1 implementation of the same policy was tested with AIPerf 0.8.0 on the canonical unique-prefix ISL256/OSL32 workload at N=8, concurrency 1024, 50,000 requests:

route / policy	output tok/s	TTFT p50/p99 ms	ITL p50/p99 ms	mean GPU
clean custom KV	19,940.9	213.4 / 3334.5	42.9 / 102.6	49.8%
explicit least-loaded	29,932.8	292.9 / 3378.9	21.4 / 88.9	78.0%
prior zero-hit active-block fix	29,893.5	292.3 / 3361.0	21.7 / 89.3	77.4%
staged decode cost	28,334.9	301.4 / 5410.8	22.6 / 43.1	74.0%

The staged run completed exactly 50,000 requests with zero failures or 503s and server-reported mean ISL/OSL 264.00016/32. It was +42.09% over clean KV, but -5.34% versus least-loaded and -5.21% versus the simpler zero-hit fix. The staged prefill and decode signals were almost perfectly anticorrelated (r=-0.999), suggesting phase segregation despite a tightly balanced combined score.

This is therefore a draft policy prototype, not a claim that it should replace the simpler fix without further evaluation on current main and workloads with meaningful prefix reuse.

datadog-official · 2026-06-28T00:07:09Z

⚠️ Warnings

🚦 8 Pipeline jobs failed

Docs link check | lychee

Lint PR | Validate PR title and add label

PR | backend-status-check

View all 8 failed jobs.

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 312d342 | Docs | Give us feedback!}

Stage KV decode cost for pending prefills

312d342

jthomson04 temporarily deployed to external_collaborator June 28, 2026 00:06 — with GitHub Actions Inactive

pull-request-size Bot added the size/XL label Jun 28, 2026

github-actions Bot added the router Relates to routing, KV-aware routing, etc. label Jun 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Stage decode cost for pending prefills#11024

[codex] Stage decode cost for pending prefills#11024
jthomson04 wants to merge 1 commit into
mainfrom
jthomson04/kv-router-staged-decode-cost

jthomson04 commented Jun 28, 2026

Uh oh!

datadog-official Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jthomson04 commented Jun 28, 2026

Summary

Implementation

Validation

Prototype benchmark

Uh oh!

datadog-official Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

datadog-official Bot commented Jun 28, 2026 •

edited

Loading