Barrier implementation improve reliability#149
Merged
Conversation
Add custom dialer with TCP keepalive (30s), NATS ping interval (20s/3 outstanding), 16MB reconnect buffer, and custom inbox prefix to prevent AWS NAT Gateway idle timeout from silently killing connections.
Require 5 consecutive health check failures (~25s) before removing a peer from Consul, preventing transient NATS reconnections from cascading into unnecessary peer eviction and ECDH re-exchange.
…fecycle improvements - Route MPC results to specific clients via clientID in topic subjects, replacing shared consumers with per-client ephemeral consumers - Replace warmUpSession() sleep with WaitForPeersReady() barrier that verifies all peers have active subscriptions before starting protocol - Add graceful session shutdown with doneCh/sendErr to prevent goroutine leaks and blocking on error channels - Add JetStream MaxAckPending for backpressure and InProgress() to prevent premature redelivery during long MPC operations - Use atomic tryAddSession/removeSession for duplicate session detection
…bytes to 100MB - Remove per-broker WithMaxAge overrides in favor of the new default - Add WithMaxBytes broker option for configurable stream size limits - Increase message queue max bytes from 10MB to 100MB
The warmUpSession sleep was replaced by the proper WaitForPeersReady barrier handshake but the old code and config were never cleaned up.
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
072630f to
b78706b
Compare
ce4d8b6 to
7fd26b9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: NATS "No Responders" & Node Unresponsiveness After Signing Failure
Every Change Explained
1.
pkg/mpc/session.go— The core fixesAdded
doneChanddoneOnceto session struct:Why: Before this, there was no way to tell a running
Sign()orReshare()goroutine to stop. When an error occurred, the error handler goroutine exited, butSign()was stuck forever inselect { case <-outCh: ... case <-endCh: ... }— neither channel would ever receive again after a failure.doneChgives us a broadcast signal: closing it wakes up every goroutine selecting on it simultaneously.Added
sendErr()method:Why: The old code did
s.ErrCh <- errdirectly.ErrChwas unbuffered, so this blocks until someone reads. The error handler goroutine was the only reader, and it exited after the first error. So whenhandleTssMessagetried to send a SECOND error (e.g., failing to reach peer B after already failing to reach peer C), it blocked forever. This blocked theSign()goroutine (sincehandleTssMessageis called from Sign's select case), which meant the goroutine leaked permanently with its NATS subscriptions still active.sendErruses a select so that if the session is stopped (doneChclosed), the send is abandoned instead of blocking.Replaced every
s.ErrCh <- ...withs.sendErr(...):Why: Every error producer in the session had the same blocking problem.
handleTssMessage,receiveP2PTssMessage,receiveBroadcastTssMessage,receiveTssMessage,subscribeFromPeersAsync,subscribeBroadcastAsync— all of them. Any one of them could deadlock the session.Added
Stop()andDone()methods:Why:
Stop()is the trigger to terminate a session.sync.Onceensures closingdoneChtwice doesn't panic. Multiple callers can safely callStop()— the error handler, the Sign method on success, or Close.Added
WaitForPeersReady()— the readiness barrier:Why: This is the fix for the "no responders" bug. The old code used
time.Sleep(200ms)hoping all peers would finish subscribing in time. This is fundamentally unreliable because:The barrier works by verification, not assumption. Each node subscribes to a barrier topic (proving its direct subscriptions are also active — they're set up before the barrier). Then each node pings every peer's barrier topic using
nats.Request(). If the peer hasn't subscribed yet, "no responders" is returned and we retry. Only when ALL peers respond do we proceed. This is deterministic — no timing assumptions.Fixed
Close()with nil checks:Why:
subscribeBroadcastAsync()runs in a goroutine. IfClose()was called before that goroutine completed (e.g., due to an early error),s.broadcastSubwas nil andUnsubscribe()would panic, crashing the node. The oldClose()also returned early on the first unsubscribe error, leaking remaining subscriptions. Now it logs errors but continues cleaning up all subscriptions.Added
barrierSubfield:Why: The barrier subscription needs to be cleaned up in
Close()like all other subscriptions, otherwise it leaks.2.
pkg/mpc/ecdsa_signing_session.goBuffered ErrCh:
Why: With buffer size 1, the first error send never blocks regardless of whether the error handler has started reading yet. This prevents a subtle race where
party.Start()errors before the error handler goroutine is scheduled.Initialized
doneCh:Why: Without initialization,
doneChis nil. Aselecton a nil channel blocks forever, sosendErrand thecase <-s.doneChinSign()would never work.Added
case <-s.doneChtoSign():Why: This is the goroutine leak fix. Before,
Sign()only selected onoutChandendCh. After an error killed the TSS protocol, neither channel would ever produce again.Sign()was stuck forever, holding NATS subscriptions open, consuming memory, and (critically) holding a reference to the session that could never be garbage collected. WithdoneCh, when the error handler callssession.Stop(),doneChcloses,Sign()returns, and the goroutine is freed.Changed
s.ErrCh <- errtos.sendErr(err)in party.Start and verify:Why: Same deadlock prevention as in session.go. If
Sign()already exited (session stopped), writing toErrChwould block forever.3.
pkg/mpc/eddsa_signing_session.goExact same changes as ECDSA. Both session types had identical bugs.
4.
pkg/mpc/ecdsa_keygen_session.go&pkg/mpc/eddsa_keygen_session.goBuffered ErrCh + doneCh:
Why: Keygen sessions embed
sessionand use the samesendErrandClosemethods. WithoutdoneChinitialized, anysendErrcall in the keygen path would block on the nil channel select. Without the bufferedErrCh, keygen errors could deadlock too (same pattern, just less likely because keygen is less error-prone than signing).Added
WaitForPeersReady()toKeyGenSessioninterface:Why: The event consumer calls
WaitForPeersReady()on keygen sessions before startingGenerateKey(). The method is implemented on the embeddedsessionstruct, but Go interfaces need it declared.5.
pkg/mpc/ecdsa_resharing_session.go&pkg/mpc/eddsa_resharing_session.goSame buffered ErrCh + doneCh + interface additions:
Why: The reshare bug reported by
treeforestwas identical to the signing bug. The reshare protocol sends direct messages between peers, and if a peer hasn't subscribed yet (warmup race), "no responders" errors cascade. After the error, the Reshare goroutine leaks, and the node stops processing future reshare events. TheWaitForPeersReady()andStop()additions to theReshareSessioninterface let the event consumer use the barrier and clean shutdown.6.
pkg/eventconsumer/event_consumer.goReplaced
warmUpSession()withWaitForPeersReady()in signing:Why: The 200ms sleep was the root cause of the "no responders" errors. It assumed all peers would finish subscribing within 200ms. Under any load, network latency, or scheduling pressure, this fails. The barrier verifies instead of assuming.
Same replacement in keygen and reshare paths:
Why: All three MPC operations (keygen, sign, reshare) had the same warmup race condition. The reshare concurrent ECDSA+EDDSA failure reported by
treeforestwas this exact bug.Error handler now calls
session.Stop()anddone():Why: Before, the error handler just returned without stopping anything.
Sign()kept running (stuck forever). The context was never canceled. Now:session.Stop()wakes upSign()viadoneCh, anddone()cancels the context so any future select onctx.Done()also exits. Clean teardown of the entire session on error.7.
pkg/mpc/registry.go— Health check resilienceAdded consecutive failure counter:
Why: The old health check deleted a peer's Consul key after just 2 failed NATS requests (~6 seconds). A single network blip would evict a peer, trigger ECDH re-exchange, and cascade into "not ready" state across the cluster. With 5 consecutive failures required (~25 seconds of continuous unresponsiveness), transient issues are tolerated while genuinely dead peers are still detected.
Increased retry attempts from 2 to 3 with 1s delay:
Why: Gives the NATS connection more time to recover from brief interruptions before counting a failure.
8.
cmd/mpcium/main.go— NATS connection hardeningTCP keepalive via custom dialer:
Why: AWS NAT Gateway/NLB silently drops TCP connections idle for ~350 seconds. TCP keepalive probes count as wire activity, preventing the idle timeout from triggering. Without this, the NATS connection could be silently killed, and neither the client nor server would know until the next message attempt.
NATS application-level pings:
Why: Even with TCP keepalive, the NATS client needs to detect dead connections at the application level. Default is 2-minute ping interval. With 20s pings and 3 outstanding max, a dead connection is detected within 60 seconds — well before the AWS 350s timeout.
Reconnect buffer:
Why: During NATS reconnection, published messages are buffered. The default buffer is small. If a signing operation produces messages during a brief reconnect, a larger buffer prevents message loss.
DisconnectErrHandlerreplacesDisconnectHandler:Why: The old handler didn't capture the disconnect reason. The new one logs the error, which is critical for diagnosing whether disconnects are caused by network issues, server shutdown, or idle timeouts.