Skip to content

Enhance messaging reliability and error handling#148

Closed
vietddude wants to merge 1 commit intomasterfrom
fix/nats-no-responders-available
Closed

Enhance messaging reliability and error handling#148
vietddude wants to merge 1 commit intomasterfrom
fix/nats-no-responders-available

Conversation

@vietddude
Copy link
Collaborator

Description

This PR resolves the critical nats: no responders available for request concurrency error and subsequent process panics that occur during high-load concurrent MPC operations (e.g., triggering ECDSA and EdDSA resharing simultaneously for the same wallet).

Root Causes

Through testing with injected network latency, we identified three compounding bugs in the Session and Messaging layers:

  1. Race Condition in Broadcast Subscription:
    subscribeBroadcastAsync was wrapping the NATS Subscribe call in a go func(). This meant the subscription happened asynchronously, allowing the initialization flow (and the 200ms warmUpSession delay) to finish before the node had actually registered its listener with the NATS server.
  2. Insufficient Retry Window in Direct Messaging:
    SendToOther in point2point.go only retried 3 times with a fixed 50ms delay. This gave a maximum resilience window of just ~150ms. In a distributed cloud environment (AWS), network jitter and goroutine scheduling delays frequently exceed 150ms, causing the sender to exhaust retries and fail the session before the receiving peer's subscription was fully established.
  3. Fatal Process Panics on Sub-Round Failures:
    When a P2P message failed (e.g. failed to calculate Alice_end), the tss-lib internal state became corrupted (missing data in round3). However, because the session remained open, subsequent incoming messages would trigger round4.Start(), resulting in a fatal nil-pointer dereference inside tss-lib that crashed the entire node process.

Changes Made

  1. Synchronous Broadcast Registration:
    Removed the go func() in subscribeBroadcastAsync(). The node now blocks until the broadcast NATS subscription is fully acknowledged by the broker, ensuring readiness before the 200ms warmUpSession timer even begins.
  2. Exponential Backoff for Direct Messages:
    Upgraded the SendToOther retry mechanism to use 10 attempts with an exponential backoff (starting at 100ms) and a MaxDelay cap of 1 second. This extends the resilience window to ~6-8 seconds, effortlessly absorbing AWS network blips and async scheduling delays without hanging the node indefinitely.
  3. Panic Recovery in Session Message Handler:
    Added a defer recover() block inside receiveTssMessage(). If tss-lib panics due to corrupted round state, the panic is caught, logged with a stack trace, and routed gracefully to the session's ErrCh. This cleanly fails the individual session without crashing the entire MPC node.

Verification

  • Successfully built and passed all existing unit tests (go test ./pkg/mpc/...).
  • Created a stress-test execution script that triggers ECDSA and EdDSA resharing perfectly simultaneously.
  • Verified under artificial distributed load (injecting 500ms asynchronous delays into direct topic subscriptions) that the new retry logic successfully allows sessions to gracefully survive severe network latency without "no responders" errors or panics.

@vietddude vietddude requested a review from anhthii March 16, 2026 10:03
@anhthii
Copy link
Contributor

anhthii commented Mar 19, 2026

@vietddude merged into this pr #149

@anhthii anhthii closed this Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants