bitswap/bsnet: don't mark peer unresponsive on a single send failure#1164
bitswap/bsnet: don't mark peer unresponsive on a single send failure#1164D4ryl00 wants to merge 4 commits into
Conversation
…failure streamMessageSender.send() marked a peer unresponsive on every failed attempt to open a stream or send a message. send() is called from multiAttempt(), which already resets the stream, retries, and marks the peer unresponsive only after all retries are exhausted, so the per-attempt marking is both redundant and premature. It is harmful after a connection bounce (a disconnect quickly followed by a reconnect): the inbound disconnect notification is suppressed because the peer is already connected again, so the ConnectEventManager stays in the responsive state and the reconnect Connected event is a no-op. A single send failure on the stale stream then transitions the peer to unresponsive and emits PeerDisconnected, sidelining a peer whose connection is in fact live. There is no recovery path: returning to responsive requires an incoming message (OnMessage), which never arrives because we have stopped sending our wantlist. Fetches from that peer hang until it fully disconnects. Rely on multiAttempt() to mark the peer unresponsive after retries fail. Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
… peer unresponsive Regression test for the reconnect hang: a recoverable send error (failure on the first attempt, success on retry) must not emit a PeerDisconnected, i.e. must not mark the still-connected peer unresponsive. Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
boxo v0.34+ marks a peer unresponsive on a single bitswap send failure, which sidelines a peer that reconnected (its disconnect being suppressed) with no recovery path, hanging replication after a peer drops and reconnects. This caused TestReplicateAutomatically to hang intermittently under load. Point boxo at a fork branch carrying the one-line fix (based on v0.39.0, the version kubo v0.41 pins). Drop this replace once the upstream fix is released. Upstream issue: ipfs/boxo#1163 Upstream PR: ipfs/boxo#1164 Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
| if !stillConnected { | ||
| t.Fatal("peer was marked unresponsive after a single recoverable send error") | ||
| } | ||
| } |
There was a problem hiding this comment.
Need to make sure there is a test that unresponsive peer is still disconnected eventually.
| stream, err := s.Connect(ctx) | ||
| if err != nil { | ||
| log.Infof("failed to open stream to %s: %s", s.to, err) | ||
| s.bsnet.connectEvtMgr.MarkUnresponsive(s.to) |
There was a problem hiding this comment.
Connect failed... do we not want to mark unresponsive here? In other words, should any retry logic be in connect?
There was a problem hiding this comment.
The retry already covers Connect: multiAttempt re-runs the whole send closure (which calls Connect on each attempt) up to MaxRetries with backoff, so no separate retry loop in Connect is needed. And connect failures are still marked unresponsive — just from multiAttempt once all retries are exhausted (line 218), rather than on the first attempt.
Marking here in send() was the bug: it fired before any retry, so a transient connect failure right after a reconnect (where the disconnect notification is suppressed) would permanently sideline a live peer. The persistent-failure → unresponsive path is covered by the new TestMessageSendErrorMarksUnresponsive.
…d unresponsive Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
Fixes #1163
Problem
Since v0.34.0, a Bitswap block fetch can hang forever after a peer disconnects and quickly reconnects.
streamMessageSender.send()marks the peer unresponsive on every failed attempt to open a stream or send a message. Butsend()is called frommultiAttempt(), which already resets the stream, retries, and marks the peer unresponsive only after all retries are exhausted — so the per-attempt marking is redundant and premature.After a connection bounce, the inbound disconnect is suppressed (the peer is already reconnected, so the bsnet notifiee's
Disconnectedreturns early), leaving theConnectEventManagerinresponsiveand making the reconnectConnecteda no-op. A single send failure on the stale stream then marks the still-connected peerunresponsiveand emitsPeerDisconnected, with no recovery path (returning toresponsiveneeds an inboundOnMessagethat never comes because we stop sending our wantlist). The fetch hangs until the peer fully disconnects.Fix
Drop the two premature
MarkUnresponsivecalls insend()and rely onmultiAttempt()to mark the peer unresponsive once retries are exhausted.Testing
go test ./bitswap/network/...passes.