Skip to content

perf(runtime): remove TCP response registry mutex#11008

Open
jthomson04 wants to merge 2 commits into
mainfrom
jthomson04/dashmap-tcp-response-registry
Open

perf(runtime): remove TCP response registry mutex#11008
jthomson04 wants to merge 2 commits into
mainfrom
jthomson04/dashmap-tcp-response-registry

Conversation

@jthomson04

@jthomson04 jthomson04 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace the hot TcpStreamServer response/request stream registry mutex with sharded DashMap registries.
  • Move the listener task handle out of the registry state.
  • Preserve endpoint-instance tombstone/cancellation behavior with concurrent registry bookkeeping.
  • Add a concurrent response registration/call-home regression test.

Why

I've been looking into benchmarks of how the Dynamo frontend scales with higher concurrencies. A customer raised an issue where the frontend was hitting a bottleneck despite frontend cpu being underutilized. To show this, they ran a benchmark with the frontend and one worker, measuring performance. They then scaled up to 8 workers with the same per-worker concurrency, and found that Dynamo was only able to acheive 4x the performance improvement compared to the expected 8x.

Benchmark Evidence

Focused 8 worker, concurrency 1024, ISL 256, OSL 32, 50k prompts, frontend CPU request/limit 32, KV routing enabled:

case output tok/s req/s frontend CPU avg/max worker CPU avg/max
before 21458.1 670.6 19.3% / 37.1% of 32 cores 3.10 / 7.20 cores per worker
after 22893.9 715.4 20.8% / 23.7% of 32 cores 4.04 / 5.42 cores per worker

Observed throughput improvement: +6.7% output tok/s.

CPU was not saturated in either run. Frontend average CPU stayed around 20% of its 32-core limit, and workers stayed well below their CPU allocation.

Registry timing also collapsed in the diagnostic run:

metric before N=8 avg/p99 after N=8 avg/p99
request-plane pre_send 14.07 / 49.10 ms 0.05 / 0.40 ms
response_stream_register 14.032 / 49.095 ms 0.011 / 0.050 ms
frontend registry access 14.019 / 49.092 ms 0.001 / 0.010 ms
TCP call-home lookup access 13.719 / 48.874 ms 0.003 / 0.010 ms

Validation

  • cargo fmt
  • cargo test -p dynamo-runtime tcp_stream_server -- --nocapture
  • cargo test -p dynamo-runtime pipeline::network::tcp::server::tests -- --nocapture
  • cargo check -p dynamo-runtime

@jthomson04 jthomson04 temporarily deployed to external_collaborator June 26, 2026 23:46 — with GitHub Actions Inactive
@datadog-official

datadog-official Bot commented Jun 26, 2026

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 2 Pipeline jobs failed

Docs link check | lychee   View in Datadog   GitHub Actions

PR | dynamo-status-check   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: e3f40de | Docs | Give us feedback!

@jthomson04 jthomson04 marked this pull request as ready for review June 27, 2026 16:44
@jthomson04 jthomson04 requested a review from a team as a code owner June 27, 2026 16:44
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Replaces TcpStreamServer's internal Arc<Mutex<State>> (backed by HashMap/HashSet) with a DashMap-based StreamRegistries. All registration, cancellation, tombstone management, and TCP connection-handling paths are updated to use the new registry. Tests are updated to assert against server.registries fields.

StreamRegistries Refactor

Layer / File(s) Summary
StreamRegistries data structure and methods
lib/runtime/src/pipeline/network/tcp/server.rs
Introduces StreamRegistries with DashMap fields replacing the old State/Mutex approach. Adds methods for tombstone pruning, insert/remove of request/response subjects, associate_instance, and cancel_registered_subject. Updates TcpStreamServer fields and constructor to store Arc<StreamRegistries> and a join handle.
TcpStreamServer public methods rewritten
lib/runtime/src/pipeline/network/tcp/server.rs
Rewrites associate_instance, cancel_recv_stream, cancel_send_stream, cancel_instance_streams, clear_instance_tombstone, and start to use StreamRegistries operations instead of mutex-locked State.
ResponseService::register RAII cleanup
lib/runtime/src/pipeline/network/tcp/server.rs
Updates request and response stream registration paths to store subjects in StreamRegistries and replace mutex-based RAII cleanup closures with cancel_registered_subject calls.
TCP connection handling propagates registries
lib/runtime/src/pipeline/network/tcp/server.rs
Threads Arc<StreamRegistries> through tcp_listener, handle_connection, process_stream, process_request_stream, and process_response_stream, replacing all mutex state accesses with registry operations.
Test updates
lib/runtime/src/pipeline/network/tcp/server.rs
Updates concurrent registration, RAII cleanup, tombstone expiry, lazy pruning, and scoping tests to assert against server.registries instead of locked state.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning It provides a strong summary and validation, but it omits required template sections like reviewer start and related issues. Reformat the PR description to include Overview, Details, Where should reviewer start?, and the required Related Issues section with an issue link or no-issue checkbox.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly states the main performance change: replacing the TCP response registry mutex.

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/runtime/src/pipeline/network/tcp/server.rs`:
- Around line 461-466: The cleanup hook in the tcp server currently spawns an
async task just to call the synchronous cancel_registered_subject, which can
delay deregistration and fail without a Tokio runtime. Update the cleanup path
in the with_cleanup closure to invoke
cleanup_registries.cancel_registered_subject(StreamType::Request,
&cleanup_subject) directly, and make the same inline change in the other cleanup
hook that uses the same pattern. Keep the logic synchronous and preserve the
existing cleanup_subject and cleanup_registries handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 452f2116-6e9f-44b3-a517-3d613efd6e23

📥 Commits

Reviewing files that changed from the base of the PR and between 333b156 and 0eb57e4.

📒 Files selected for processing (1)
  • lib/runtime/src/pipeline/network/tcp/server.rs

Comment thread lib/runtime/src/pipeline/network/tcp/server.rs Outdated

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@jthomson04 jthomson04 changed the title [codex] Remove TCP response registry mutex perf(runtime): remove TCP response registry mutex Jun 27, 2026
@github-actions github-actions Bot added the perf label Jun 27, 2026
@jthomson04 jthomson04 temporarily deployed to external_collaborator June 27, 2026 17:42 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant