bitswap/bsnet: don't mark peer unresponsive on a single send failure by D4ryl00 · Pull Request #1164 · ipfs/boxo

D4ryl00 · 2026-06-02T13:36:19Z

Problem

Since v0.34.0, a Bitswap block fetch can hang forever after a peer disconnects and quickly reconnects. streamMessageSender.send() marks the peer unresponsive on every failed attempt to open a stream or send a message. But send() is called from multiAttempt(), which already resets the stream, retries, and marks the peer unresponsive only after all retries are exhausted — so the per-attempt marking is redundant and premature.

After a connection bounce, the inbound disconnect is suppressed (the peer is already reconnected, so the bsnet notifiee's Disconnected returns early), leaving the ConnectEventManager in responsive and making the reconnect Connected a no-op. A single send failure on the stale stream then marks the still-connected peer unresponsive and emits PeerDisconnected, with no recovery path (returning to responsive needs an inbound OnMessage that never comes because we stop sending our wantlist). The fetch hangs until the peer fully disconnects.

Fix

Drop the two premature MarkUnresponsive calls in send() and rely on multiAttempt() to mark the peer unresponsive once retries are exhausted.

Testing

A deterministic mocknet reproducer (see bitswap: block fetch hangs forever after peer disconnect+reconnect (regression in v0.34.0) #1163) goes from ~100% hang to 0% hang with this change.
go test ./bitswap/network/... passes.

…failure streamMessageSender.send() marked a peer unresponsive on every failed attempt to open a stream or send a message. send() is called from multiAttempt(), which already resets the stream, retries, and marks the peer unresponsive only after all retries are exhausted, so the per-attempt marking is both redundant and premature. It is harmful after a connection bounce (a disconnect quickly followed by a reconnect): the inbound disconnect notification is suppressed because the peer is already connected again, so the ConnectEventManager stays in the responsive state and the reconnect Connected event is a no-op. A single send failure on the stale stream then transitions the peer to unresponsive and emits PeerDisconnected, sidelining a peer whose connection is in fact live. There is no recovery path: returning to responsive requires an incoming message (OnMessage), which never arrives because we have stopped sending our wantlist. Fetches from that peer hang until it fully disconnects. Rely on multiAttempt() to mark the peer unresponsive after retries fail. Signed-off-by: D4ryl00 <d4ryl00@gmail.com>

Signed-off-by: D4ryl00 <d4ryl00@gmail.com>

… peer unresponsive Regression test for the reconnect hang: a recoverable send error (failure on the first attempt, success on retry) must not emit a PeerDisconnected, i.e. must not mark the still-connected peer unresponsive. Signed-off-by: D4ryl00 <d4ryl00@gmail.com>

boxo v0.34+ marks a peer unresponsive on a single bitswap send failure, which sidelines a peer that reconnected (its disconnect being suppressed) with no recovery path, hanging replication after a peer drops and reconnects. This caused TestReplicateAutomatically to hang intermittently under load. Point boxo at a fork branch carrying the one-line fix (based on v0.39.0, the version kubo v0.41 pins). Drop this replace once the upstream fix is released. Upstream issue: ipfs/boxo#1163 Upstream PR: ipfs/boxo#1164 Signed-off-by: D4ryl00 <d4ryl00@gmail.com>

gammazero · 2026-06-02T14:39:50Z

+	if !stillConnected {
+		t.Fatal("peer was marked unresponsive after a single recoverable send error")
+	}
+}


Need to make sure there is a test that unresponsive peer is still disconnected eventually.

gammazero · 2026-06-02T14:43:18Z

 	stream, err := s.Connect(ctx)
 	if err != nil {
 		log.Infof("failed to open stream to %s: %s", s.to, err)
-		s.bsnet.connectEvtMgr.MarkUnresponsive(s.to)


Connect failed... do we not want to mark unresponsive here? In other words, should any retry logic be in connect?

The retry already covers Connect: multiAttempt re-runs the whole send closure (which calls Connect on each attempt) up to MaxRetries with backoff, so no separate retry loop in Connect is needed. And connect failures are still marked unresponsive — just from multiAttempt once all retries are exhausted (line 218), rather than on the first attempt.

Marking here in send() was the bug: it fired before any retry, so a transient connect failure right after a reconnect (where the disconnect notification is suppressed) would permanently sideline a live peer. The persistent-failure → unresponsive path is covered by the new TestMessageSendErrorMarksUnresponsive.

…d unresponsive Signed-off-by: D4ryl00 <d4ryl00@gmail.com>

D4ryl00 requested a review from a team as a code owner June 2, 2026 13:36

D4ryl00 added 2 commits June 2, 2026 15:42

docs: changelog for bitswap unresponsive-on-reconnect fix

c0d42e7

Signed-off-by: D4ryl00 <d4ryl00@gmail.com>

gammazero reviewed Jun 2, 2026

View reviewed changes

lidel self-assigned this Jun 2, 2026

gammazero added the need/maintainers-input Needs input from the current maintainer(s) label Jun 2, 2026

gammazero reviewed Jun 2, 2026

View reviewed changes

bitswap/network/bsnet: test that a persistently failing peer is marke…

9447cbd

…d unresponsive Signed-off-by: D4ryl00 <d4ryl00@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bitswap/bsnet: don't mark peer unresponsive on a single send failure#1164

bitswap/bsnet: don't mark peer unresponsive on a single send failure#1164
D4ryl00 wants to merge 4 commits into
ipfs:mainfrom
D4ryl00:fix/bitswap-reconnect-mark-unresponsive

D4ryl00 commented Jun 2, 2026

Uh oh!

gammazero Jun 2, 2026

Uh oh!

D4ryl00 Jun 2, 2026

Uh oh!

gammazero Jun 2, 2026

Uh oh!

D4ryl00 Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

D4ryl00 commented Jun 2, 2026

Problem

Fix

Testing

Uh oh!

gammazero Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

D4ryl00 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gammazero Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

D4ryl00 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants