Skip to content

bitswap/bsnet: don't mark peer unresponsive on a single send failure#1164

Open
D4ryl00 wants to merge 4 commits into
ipfs:mainfrom
D4ryl00:fix/bitswap-reconnect-mark-unresponsive
Open

bitswap/bsnet: don't mark peer unresponsive on a single send failure#1164
D4ryl00 wants to merge 4 commits into
ipfs:mainfrom
D4ryl00:fix/bitswap-reconnect-mark-unresponsive

Conversation

@D4ryl00
Copy link
Copy Markdown

@D4ryl00 D4ryl00 commented Jun 2, 2026

Fixes #1163

Problem

Since v0.34.0, a Bitswap block fetch can hang forever after a peer disconnects and quickly reconnects. streamMessageSender.send() marks the peer unresponsive on every failed attempt to open a stream or send a message. But send() is called from multiAttempt(), which already resets the stream, retries, and marks the peer unresponsive only after all retries are exhausted — so the per-attempt marking is redundant and premature.

After a connection bounce, the inbound disconnect is suppressed (the peer is already reconnected, so the bsnet notifiee's Disconnected returns early), leaving the ConnectEventManager in responsive and making the reconnect Connected a no-op. A single send failure on the stale stream then marks the still-connected peer unresponsive and emits PeerDisconnected, with no recovery path (returning to responsive needs an inbound OnMessage that never comes because we stop sending our wantlist). The fetch hangs until the peer fully disconnects.

Fix

Drop the two premature MarkUnresponsive calls in send() and rely on multiAttempt() to mark the peer unresponsive once retries are exhausted.

Testing

…failure

streamMessageSender.send() marked a peer unresponsive on every failed
attempt to open a stream or send a message. send() is called from
multiAttempt(), which already resets the stream, retries, and marks the
peer unresponsive only after all retries are exhausted, so the per-attempt
marking is both redundant and premature.

It is harmful after a connection bounce (a disconnect quickly followed by
a reconnect): the inbound disconnect notification is suppressed because the
peer is already connected again, so the ConnectEventManager stays in the
responsive state and the reconnect Connected event is a no-op. A single
send failure on the stale stream then transitions the peer to unresponsive
and emits PeerDisconnected, sidelining a peer whose connection is in fact
live. There is no recovery path: returning to responsive requires an
incoming message (OnMessage), which never arrives because we have stopped
sending our wantlist. Fetches from that peer hang until it fully disconnects.

Rely on multiAttempt() to mark the peer unresponsive after retries fail.

Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
@D4ryl00 D4ryl00 requested a review from a team as a code owner June 2, 2026 13:36
D4ryl00 added 2 commits June 2, 2026 15:42
Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
… peer unresponsive

Regression test for the reconnect hang: a recoverable send error (failure
on the first attempt, success on retry) must not emit a PeerDisconnected,
i.e. must not mark the still-connected peer unresponsive.

Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
D4ryl00 added a commit to D4ryl00/go-orbit-db that referenced this pull request Jun 2, 2026
boxo v0.34+ marks a peer unresponsive on a single bitswap send failure,
which sidelines a peer that reconnected (its disconnect being suppressed)
with no recovery path, hanging replication after a peer drops and
reconnects. This caused TestReplicateAutomatically to hang intermittently
under load.

Point boxo at a fork branch carrying the one-line fix (based on v0.39.0,
the version kubo v0.41 pins). Drop this replace once the upstream fix is
released.

Upstream issue: ipfs/boxo#1163
Upstream PR:    ipfs/boxo#1164

Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
if !stillConnected {
t.Fatal("peer was marked unresponsive after a single recoverable send error")
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to make sure there is a test that unresponsive peer is still disconnected eventually.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidel lidel self-assigned this Jun 2, 2026
@gammazero gammazero added the need/maintainers-input Needs input from the current maintainer(s) label Jun 2, 2026
stream, err := s.Connect(ctx)
if err != nil {
log.Infof("failed to open stream to %s: %s", s.to, err)
s.bsnet.connectEvtMgr.MarkUnresponsive(s.to)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connect failed... do we not want to mark unresponsive here? In other words, should any retry logic be in connect?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry already covers Connect: multiAttempt re-runs the whole send closure (which calls Connect on each attempt) up to MaxRetries with backoff, so no separate retry loop in Connect is needed. And connect failures are still marked unresponsive — just from multiAttempt once all retries are exhausted (line 218), rather than on the first attempt.

Marking here in send() was the bug: it fired before any retry, so a transient connect failure right after a reconnect (where the disconnect notification is suppressed) would permanently sideline a live peer. The persistent-failure → unresponsive path is covered by the new TestMessageSendErrorMarksUnresponsive.

…d unresponsive

Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

need/maintainers-input Needs input from the current maintainer(s)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bitswap: block fetch hangs forever after peer disconnect+reconnect (regression in v0.34.0)

3 participants