Skip to content

perf(mocker): reduce scheduler and FPM publish overhead#11025

Draft
jthomson04 wants to merge 1 commit into
mainfrom
jthomson04/mocker-cpu-overhead
Draft

perf(mocker): reduce scheduler and FPM publish overhead#11025
jthomson04 wants to merge 1 commit into
mainfrom
jthomson04/mocker-cpu-overhead

Conversation

@jthomson04

Copy link
Copy Markdown
Contributor

Summary

  • skip tokio_timerfd::Delay construction when scheduler work has already consumed a mocker's modeled deadline; future deadlines retain the precise timerfd path
  • reuse the single FPM publisher task's MessagePack buffer, borrow the worker ID during serialization, and encode event envelopes from borrowed payload/topic data
  • cache the formatted event-plane subject and parsed NATS subject instead of rebuilding and validating them for every FPM event

The FPM path still publishes every snapshot in channel order and preserves the existing per-rank counters and idle-heartbeat behavior. The owned event publication and public string-based NATS entry points remain available, and a regression test verifies that the borrowed envelope is byte-for-byte identical to the existing owned encoding.

Benchmark

The comparison used an optimized build with debug symbols on the same host throughout:

  • one mocker process/worker pinned to CPUs 0-1
  • one KV-routing frontend plus aiperf sharing CPUs 2-23
  • Qwen/Qwen3-0.6B, 64-token blocks, --speedup-ratio 1000000
  • aiperf c256 with exact ISL/OSL 1024/1024, zero variance, ignore_eos, fixed seed, 10-second warmup, and a 45-second measured interval
  • fresh frontend/mocker processes per run; all reported runs had zero errors and cancellations
build clean throughput incremental benefit cumulative benefit
original main 207.95 req/s baseline baseline
expired-deadline fast path 258.47 req/s +24.30% +24.30%
FPM buffer + subject reuse 273.65 req/s +5.87% +31.59%

The FPM change also reduced mean latency by 5.62% and p99 latency by 5.44% relative to the timer-only build. A separate perf stat run measured 7.09% higher throughput and 5.85% fewer task-clock CPU seconds/request.

User+kernel profiles at 99 Hz contained about 11K samples with no lost samples and less than 1% unknown leaves. The timer change reduced the precise-sleep stack from 12.15% to 0.18%. The FPM change then reduced FPM publisher inclusive cycles from 14.50% to 9.78%, serialization from 3.64% to 0.71%, subject conversion from 0.65% to 0.04%, and NATS subject validation from 0.84% to 0.17%.

Validation

  • cargo fmt --all --check
  • cargo clippy -p dynamo-runtime -p dynamo-llm --lib -- -D warnings
  • cargo test -p dynamo-llm --lib (1,389 passed, 5 ignored)
  • cargo test -p dynamo-mocker (468 passed, 1 ignored)
  • cargo test -p dynamo-runtime transports::event_plane::codec::tests -- --nocapture (3 passed)
  • focused FPM serialization/heartbeat tests (6 passed)
  • two clean 45-second aiperf repeats, one 60-second user+kernel profile, and one separate counter run for each optimization stage

The full dynamo-runtime suite was attempted, but the existing pipeline::network::egress::push_router::tests::transport_resolution_falls_back_when_selected_instance_disappears test hung. It reproduced when run alone and was stopped after 120 seconds; focused tests exercising this patch pass.

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
@jthomson04 jthomson04 temporarily deployed to external_collaborator June 28, 2026 02:57 — with GitHub Actions Inactive
@github-actions github-actions Bot added the perf label Jun 28, 2026
@datadog-official

datadog-official Bot commented Jun 28, 2026

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 3 Pipeline jobs failed

Docs link check | lychee   View in Datadog   GitHub Actions

PR | dynamo-runtime / rust-gpu   View in Datadog   GitHub Actions

PR | dynamo-status-check   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 6aa9491 | Docs | Give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant