Enhance messaging reliability and error handling by vietddude · Pull Request #148 · fystack/mpcium

vietddude · 2026-03-16T10:03:33Z

Description

This PR resolves the critical nats: no responders available for request concurrency error and subsequent process panics that occur during high-load concurrent MPC operations (e.g., triggering ECDSA and EdDSA resharing simultaneously for the same wallet).

Root Causes

Through testing with injected network latency, we identified three compounding bugs in the Session and Messaging layers:

Race Condition in Broadcast Subscription:
subscribeBroadcastAsync was wrapping the NATS Subscribe call in a go func(). This meant the subscription happened asynchronously, allowing the initialization flow (and the 200ms warmUpSession delay) to finish before the node had actually registered its listener with the NATS server.
Insufficient Retry Window in Direct Messaging:
SendToOther in point2point.go only retried 3 times with a fixed 50ms delay. This gave a maximum resilience window of just ~150ms. In a distributed cloud environment (AWS), network jitter and goroutine scheduling delays frequently exceed 150ms, causing the sender to exhaust retries and fail the session before the receiving peer's subscription was fully established.
Fatal Process Panics on Sub-Round Failures:
When a P2P message failed (e.g. failed to calculate Alice_end), the tss-lib internal state became corrupted (missing data in round3). However, because the session remained open, subsequent incoming messages would trigger round4.Start(), resulting in a fatal nil-pointer dereference inside tss-lib that crashed the entire node process.

Changes Made

Synchronous Broadcast Registration:
Removed the go func() in subscribeBroadcastAsync(). The node now blocks until the broadcast NATS subscription is fully acknowledged by the broker, ensuring readiness before the 200ms warmUpSession timer even begins.
Exponential Backoff for Direct Messages:
Upgraded the SendToOther retry mechanism to use 10 attempts with an exponential backoff (starting at 100ms) and a MaxDelay cap of 1 second. This extends the resilience window to ~6-8 seconds, effortlessly absorbing AWS network blips and async scheduling delays without hanging the node indefinitely.
Panic Recovery in Session Message Handler:
Added a defer recover() block inside receiveTssMessage(). If tss-lib panics due to corrupted round state, the panic is caught, logged with a stack trace, and routed gracefully to the session's ErrCh. This cleanly fails the individual session without crashing the entire MPC node.

Verification

Successfully built and passed all existing unit tests (go test ./pkg/mpc/...).
Created a stress-test execution script that triggers ECDSA and EdDSA resharing perfectly simultaneously.
Verified under artificial distributed load (injecting 500ms asynchronous delays into direct topic subscriptions) that the new retry logic successfully allows sessions to gracefully survive severe network latency without "no responders" errors or panics.

anhthii · 2026-03-19T01:33:07Z

@vietddude merged into this pr #149

Enhance messaging reliability and error handling

01cd50c

vietddude requested a review from anhthii March 16, 2026 10:03

anhthii closed this Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance messaging reliability and error handling#148

Enhance messaging reliability and error handling#148
vietddude wants to merge 1 commit intomasterfrom
fix/nats-no-responders-available

vietddude commented Mar 16, 2026

Uh oh!

anhthii commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vietddude commented Mar 16, 2026

Description

Root Causes

Changes Made

Verification

Uh oh!

anhthii commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants