[codex] Stage decode cost for pending prefills#11024
Draft
jthomson04 wants to merge 1 commit into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Prototype a staged decode-cost model for aggregated KV routing so uncached prompt blocks are not represented in both the prefill and decode terms while prefill is pending.
The existing configured prefill weighting, overlap credit, and selector formula remain in place; this changes only the decode-block projection supplied to that formula.
Implementation
ActiveLoad, and replica event wire format.dynamo_frontend_worker_selector_decode_blocksseparately from the existing full active-block gauge.Disaggregated routing is intentionally unchanged. The prefill path has no active-block hashes, and the decode override still disables prefill tracking and overlap credit, so both bypass staged state and retain full active-block decode cost. The bootstrap and synchronous paths continue through the same common decode override.
At equal prefill/decode weights, a fully cached request and a cold request now tie when otherwise unloaded: the prompt blocks move from the prefill term to the decode term. The online replay cache-affinity fixture therefore sets
prefill_load_scale=2explicitly. The separate deterministic offline replay simulator retains its existing legacy projection in this prototype.Validation
cargo test -p dynamo-kv-router: 627 passedcargo test -p dynamo-mocker --lib: 468 passed, 1 pre-existing ignoredcargo test -p dynamo-llm kv_router::prefill_router::tests: 12 passedcargo test -p dynamo-llm test_worker_load_metrics_pef: passedcargo check -p dynamo-llm -p dynamo-mocker: passedcargo clippy -p dynamo-kv-router -p dynamo-llm -p dynamo-mocker --lib -- -D warnings: passedcargo fmt --all -- --checkandgit diff --check: passedLifecycle coverage includes zero/partial/full overlap, partial final prompt blocks, promotion, shared-prefix deduplication, output-block fractional decay, free, expiry, topology cleanup, replica synchronization, and both configuration fallbacks.
Prototype benchmark
An earlier Dynamo 1.1.1 implementation of the same policy was tested with AIPerf 0.8.0 on the canonical unique-prefix ISL256/OSL32 workload at N=8, concurrency 1024, 50,000 requests:
The staged run completed exactly 50,000 requests with zero failures or 503s and server-reported mean ISL/OSL 264.00016/32. It was +42.09% over clean KV, but -5.34% versus least-loaded and -5.21% versus the simpler zero-hit fix. The staged prefill and decode signals were almost perfectly anticorrelated (
r=-0.999), suggesting phase segregation despite a tightly balanced combined score.This is therefore a draft policy prototype, not a claim that it should replace the simpler fix without further evaluation on current main and workloads with meaningful prefix reuse.