Skip to content

[codex] Stage decode cost for pending prefills#11024

Draft
jthomson04 wants to merge 1 commit into
mainfrom
jthomson04/kv-router-staged-decode-cost
Draft

[codex] Stage decode cost for pending prefills#11024
jthomson04 wants to merge 1 commit into
mainfrom
jthomson04/kv-router-staged-decode-cost

Conversation

@jthomson04

Copy link
Copy Markdown
Contributor

Summary

Prototype a staged decode-cost model for aggregated KV routing so uncached prompt blocks are not represented in both the prefill and decode terms while prefill is pending.

prefill_cost = active prefill tokens + incoming uncached tokens

decode_cost = unique blocks from decode-active requests
            + cached-prefix blocks from prefill-active requests
            + incoming cached-prefix blocks

The existing configured prefill weighting, overlap credit, and selector formula remain in place; this changes only the decode-block projection supplied to that formula.

Implementation

  • Add a selector-only block tracker and prompt-membership trie with unique-block semantics.
  • At admission, book the full prompt in the existing active-block state but only the cached complete-block prefix in staged selector state.
  • Promote the uncached suffix when prefill completes, and release both trackers on free or expiry.
  • Include generated output blocks and fractional decay in staged state.
  • Gate staging on both prefill-token tracking and at least one complete prompt hash. Missing either signal uses the exact legacy projection.
  • Preserve existing full active blocks, capacity/admission state, queue behavior, published ActiveLoad, and replica event wire format.
  • Export dynamo_frontend_worker_selector_decode_blocks separately from the existing full active-block gauge.

Disaggregated routing is intentionally unchanged. The prefill path has no active-block hashes, and the decode override still disables prefill tracking and overlap credit, so both bypass staged state and retain full active-block decode cost. The bootstrap and synchronous paths continue through the same common decode override.

At equal prefill/decode weights, a fully cached request and a cold request now tie when otherwise unloaded: the prompt blocks move from the prefill term to the decode term. The online replay cache-affinity fixture therefore sets prefill_load_scale=2 explicitly. The separate deterministic offline replay simulator retains its existing legacy projection in this prototype.

Validation

  • cargo test -p dynamo-kv-router: 627 passed
  • cargo test -p dynamo-mocker --lib: 468 passed, 1 pre-existing ignored
  • cargo test -p dynamo-llm kv_router::prefill_router::tests: 12 passed
  • cargo test -p dynamo-llm test_worker_load_metrics_pef: passed
  • cargo check -p dynamo-llm -p dynamo-mocker: passed
  • cargo clippy -p dynamo-kv-router -p dynamo-llm -p dynamo-mocker --lib -- -D warnings: passed
  • cargo fmt --all -- --check and git diff --check: passed

Lifecycle coverage includes zero/partial/full overlap, partial final prompt blocks, promotion, shared-prefix deduplication, output-block fractional decay, free, expiry, topology cleanup, replica synchronization, and both configuration fallbacks.

Prototype benchmark

An earlier Dynamo 1.1.1 implementation of the same policy was tested with AIPerf 0.8.0 on the canonical unique-prefix ISL256/OSL32 workload at N=8, concurrency 1024, 50,000 requests:

route / policy output tok/s TTFT p50/p99 ms ITL p50/p99 ms mean GPU
clean custom KV 19,940.9 213.4 / 3334.5 42.9 / 102.6 49.8%
explicit least-loaded 29,932.8 292.9 / 3378.9 21.4 / 88.9 78.0%
prior zero-hit active-block fix 29,893.5 292.3 / 3361.0 21.7 / 89.3 77.4%
staged decode cost 28,334.9 301.4 / 5410.8 22.6 / 43.1 74.0%

The staged run completed exactly 50,000 requests with zero failures or 503s and server-reported mean ISL/OSL 264.00016/32. It was +42.09% over clean KV, but -5.34% versus least-loaded and -5.21% versus the simpler zero-hit fix. The staged prefill and decode signals were almost perfectly anticorrelated (r=-0.999), suggesting phase segregation despite a tightly balanced combined score.

This is therefore a draft policy prototype, not a claim that it should replace the simpler fix without further evaluation on current main and workloads with meaningful prefix reuse.

@jthomson04 jthomson04 temporarily deployed to external_collaborator June 28, 2026 00:06 — with GitHub Actions Inactive
@github-actions github-actions Bot added the router Relates to routing, KV-aware routing, etc. label Jun 28, 2026
@datadog-official

datadog-official Bot commented Jun 28, 2026

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 8 Pipeline jobs failed

Docs link check | lychee   View in Datadog   GitHub Actions

Lint PR | Validate PR title and add label   View in Datadog   GitHub Actions

PR | backend-status-check   View in Datadog   GitHub Actions

View all 8 failed jobs.

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 312d342 | Docs | Give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

router Relates to routing, KV-aware routing, etc. size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant