Skip to content

Fix silent event loss in DomainParticipant status channel#401

Merged
jhelovuo merged 1 commit intoAtostek:masterfrom
alvgaona:fix/status-channel-overflow
Mar 8, 2026
Merged

Fix silent event loss in DomainParticipant status channel#401
jhelovuo merged 1 commit intoAtostek:masterfrom
alvgaona:fix/status-channel-overflow

Conversation

@alvgaona
Copy link
Contributor

@alvgaona alvgaona commented Mar 7, 2026

Summary

The DomainParticipantStatusEvent channel silently drops events when its capacity is exceeded. The try_send implementation converts a Full error into Ok(()), so neither the sender nor the consumer can detect the loss. The only indication is a trace!-level log that is invisible at normal log levels.

This PR:

  • Increases the channel capacity from 16 to 2048
  • Upgrades the overflow log from trace! to warn!
  • Minor formatting fixes (comment alignment, import grouping)

How was this discovered

I discovered this bug while building a ROS 2 application that uses RustDDS. When using the status_listener() API to introspect the DDS graph, some nodes' services and topics would not show up at all. Which nodes were visible varied between runs. After ruling out consumer-side issues and confirming SEDP was delivering all data correctly, we traced the problem to the bounded status channel silently dropping events.

Why 16 is not enough

A single DDS participant can expose many endpoints. When SEDP discovers a remote participant, it generates WriterDetected and ReaderDetected events for every endpoint in a burst — faster than the consumer can drain.

For example, a typical ROS 2 node creates ~9 writers and ~7 readers (~16 status events per node). Two nodes produce ~32 events, already exceeding the channel capacity. At 10 nodes (~160 events), 90% of events are silently lost. This is not specific to ROS 2 — any DDS application with multiple endpoints per participant will hit this with modest scale.

Why 2048

  • Covers realistic deployments. 2048 handles ~120 participants in a single SEDP burst, which covers the vast majority of DDS systems.
  • Negligible cost. The channel is a notification pipe — the actual endpoint data already lives in the DiscoveryDB. At ~200-500 bytes per event, peak buffer is ~1 MB. That's nothing compared to the SPDP/SEDP threads, network buffers, and DDSCache already maintained.
  • No performance impact. Send/receive are O(1) regardless of capacity. In steady state the channel is nearly empty — it only fills during initial discovery bursts.
  • No behavioral change. Events that previously went through still go through. The only difference is that events that were silently dropped now have room in the channel.

The DomainParticipantStatusEvent channel had a capacity of 16 and
silently dropped events when full, causing downstream consumers to miss
endpoint discoveries entirely with no indication of data loss.

A single participant can expose many endpoints (e.g. ~16 in a typical
ROS 2 node), so even two participants overwhelm a 16-slot channel
during the initial SEDP burst.

Changes:
- Increase status channel capacity from 16 to 2048
- Upgrade log level from trace! to warn! on channel overflow
@alvgaona alvgaona force-pushed the fix/status-channel-overflow branch from 008c582 to 5b70e40 Compare March 7, 2026 22:05
@alvgaona
Copy link
Contributor Author

alvgaona commented Mar 7, 2026

@jhelovuo would you mind reviewing this PR? It's a fix for proper interoperability with ROS 2 DDS.

@jhelovuo jhelovuo merged commit 724950c into Atostek:master Mar 8, 2026
7 checks passed
@jhelovuo
Copy link
Member

jhelovuo commented Mar 8, 2026

Nice debugging work. Thank you for the contribution.

@alvgaona alvgaona deleted the fix/status-channel-overflow branch March 8, 2026 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants