Enhance messaging reliability and error handling#148
Closed
Enhance messaging reliability and error handling#148
Conversation
Contributor
|
@vietddude merged into this pr #149 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR resolves the critical
nats: no responders available for requestconcurrency error and subsequent process panics that occur during high-load concurrent MPC operations (e.g., triggering ECDSA and EdDSA resharing simultaneously for the same wallet).Root Causes
Through testing with injected network latency, we identified three compounding bugs in the Session and Messaging layers:
subscribeBroadcastAsync was wrapping the NATS
Subscribecall in ago func(). This meant the subscription happened asynchronously, allowing the initialization flow (and the 200ms warmUpSession delay) to finish before the node had actually registered its listener with the NATS server.SendToOther in point2point.go only retried 3 times with a fixed 50ms delay. This gave a maximum resilience window of just ~150ms. In a distributed cloud environment (AWS), network jitter and goroutine scheduling delays frequently exceed 150ms, causing the sender to exhaust retries and fail the session before the receiving peer's subscription was fully established.
When a P2P message failed (e.g.
failed to calculate Alice_end), thetss-libinternal state became corrupted (missing data inround3). However, because the session remained open, subsequent incoming messages would triggerround4.Start(), resulting in a fatal nil-pointer dereference insidetss-libthat crashed the entire node process.Changes Made
Removed the
go func()in subscribeBroadcastAsync(). The node now blocks until the broadcast NATS subscription is fully acknowledged by the broker, ensuring readiness before the 200ms warmUpSession timer even begins.Upgraded the SendToOther retry mechanism to use 10 attempts with an exponential backoff (starting at 100ms) and a
MaxDelaycap of 1 second. This extends the resilience window to ~6-8 seconds, effortlessly absorbing AWS network blips and async scheduling delays without hanging the node indefinitely.Added a
defer recover()block inside receiveTssMessage(). Iftss-libpanics due to corrupted round state, the panic is caught, logged with a stack trace, and routed gracefully to the session's ErrCh. This cleanly fails the individual session without crashing the entire MPC node.Verification
go test ./pkg/mpc/...).