fix(runtime): give event publishers unique discovery identities#11014
fix(runtime): give event publishers unique discovery identities#11014zhongdaor-nv wants to merge 2 commits into
Conversation
Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
| COUNTER_WORKER_SCRIPT = os.path.join(os.path.dirname(__file__), "counter_worker.py") | ||
|
|
||
|
|
||
| @pytest.fixture(autouse=True) |
There was a problem hiding this comment.
🚩 Removed NATS event plane fixture may rely on default transport being NATS for etcd backends
The removed _pin_nats_event_plane_for_mocker fixture (test_router_e2e_with_mockers.py:63-79) explicitly set DYN_EVENT_PLANE=nats for non-file store backends to work around the ZMQ slow-joiner problem where mock engines publish events before ZMQ subscriptions are established. The fixture's own comment states: 'on the (now default) ZMQ event plane the router observes zero events and the routing assertions fail.' Removing it is safe only if the runtime's default_event_transport_kind() already returns NATS for distributed backends (etcd/kubernetes). If so, the fixture was redundant. If any test parameterization uses etcd but the default event plane is still ZMQ, these tests could become flaky. Worth confirming the default transport for etcd-backed tests.
Was this helpful? React with 👍 or 👎 to provide feedback.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (6)
💤 Files with no reviewable changes (1)
WalkthroughThe PR splits discovery ownership from event-channel publisher identity, threads the new publisher id through event publisher registration and ZMQ subscriber matching, updates discovery logging, adds shutdown-safe unregistration, expands ZMQ integration coverage, and removes a router mocker environment override. ChangesDiscovery and event-plane identity flow
Router mocker harness
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
Comment |
Summary
EventPublishera fork-safe, per-incarnation publisher ID and use the same ID in bothEventEnvelopeand event-channel discovery.DiscoveryInstance::EventChannelshape while allowing multiple publishers for the same process/component/topic.DiscoveryInstanceIdand ignore unrelated topics.The root cause was that each publisher bound its own ZMQ endpoint and owned its own sequence counter, but all publishers in one process registered with the process-level discovery ID. Their discovery keys therefore collided, subscribers only connected to one endpoint, and dropping any colliding publisher could remove the shared registration.
This also makes
(publisher_id, sequence)valid for broker deduplication across multiple publishers. The wire schema and discovery path structure are unchanged. The public RustDiscoverySpec::EventChannelvariant now requirespublisher_id.Validation
cargo fmt --all -- --checkgit diff --checkcargo test -p dynamo-runtime transports::event_plane:: --lib— 17 passed after rebasing onto latestmaincargo clippy -p dynamo-runtime --lib --tests -- -D warningscargo test -p dynamo-runtime --lib— 410 passed, 2 ignoredcargo check -p dynamo-llm -p dynamo-backend-commonmaturin develop --uvpytest -q -s 'tests/router/test_router_e2e_with_mockers.py::test_router_decisions[nats_core-tcp]'— 1 passed in 40.73s with the default ZMQ event plane, two workers, anddp_size=4pytest -q -s 'tests/router/test_router_e2e_with_mockers.py::test_router_decisions_disagg_round_robin_prefill_dp_rank[no_bootstrap-prefill_first]'— 1 passed in 80.00s with the default ZMQ event planepytest -q -s 'tests/router/test_router_e2e_with_mockers.py::test_router_decisions[jetstream-tcp]'— 1 passed in 40.64s, confirming durable coverage still opts into NATSpre-commit run --files tests/router/test_router_e2e_with_mockers.pySummary by CodeRabbit
New Features
Bug Fixes