perf(mocker): reduce scheduler and FPM publish overhead#11025
Draft
jthomson04 wants to merge 1 commit into
Draft
perf(mocker): reduce scheduler and FPM publish overhead#11025jthomson04 wants to merge 1 commit into
jthomson04 wants to merge 1 commit into
Conversation
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tokio_timerfd::Delayconstruction when scheduler work has already consumed a mocker's modeled deadline; future deadlines retain the precise timerfd pathThe FPM path still publishes every snapshot in channel order and preserves the existing per-rank counters and idle-heartbeat behavior. The owned event publication and public string-based NATS entry points remain available, and a regression test verifies that the borrowed envelope is byte-for-byte identical to the existing owned encoding.
Benchmark
The comparison used an optimized build with debug symbols on the same host throughout:
0-12-23Qwen/Qwen3-0.6B, 64-token blocks,--speedup-ratio 10000001024/1024, zero variance,ignore_eos, fixed seed, 10-second warmup, and a 45-second measured intervalmainThe FPM change also reduced mean latency by
5.62%and p99 latency by5.44%relative to the timer-only build. A separateperf statrun measured7.09%higher throughput and5.85%fewer task-clock CPU seconds/request.User+kernel profiles at 99 Hz contained about 11K samples with no lost samples and less than 1% unknown leaves. The timer change reduced the precise-sleep stack from
12.15%to0.18%. The FPM change then reduced FPM publisher inclusive cycles from14.50%to9.78%, serialization from3.64%to0.71%, subject conversion from0.65%to0.04%, and NATS subject validation from0.84%to0.17%.Validation
cargo fmt --all --checkcargo clippy -p dynamo-runtime -p dynamo-llm --lib -- -D warningscargo test -p dynamo-llm --lib(1,389 passed, 5 ignored)cargo test -p dynamo-mocker(468 passed, 1 ignored)cargo test -p dynamo-runtime transports::event_plane::codec::tests -- --nocapture(3 passed)The full
dynamo-runtimesuite was attempted, but the existingpipeline::network::egress::push_router::tests::transport_resolution_falls_back_when_selected_instance_disappearstest hung. It reproduced when run alone and was stopped after 120 seconds; focused tests exercising this patch pass.