Skip to content

Research Review: Pushsync Silent Chunk Loss #5400

@gacevicljubisa

Description

@gacevicljubisa

PR: #5390
Branch: fix/pushsync-out-of-depth-chunk-loss

Summary

Chunks can silently disappear from the network after a successful upload. Three bugs in the pushsync/pusher pipeline allow a chunk to be "delivered" (from the origin's perspective) while no node in the correct neighborhood actually retains it. The proposed fixes change protocol-level forwarding and receipt handling behavior.

Note: These issues are reproducible in smaller clusters such as the public testnet, where limited neighborhood sizes make out-of-depth forwarding and shallow receipts more likely.


Bug 1: Out-of-Depth Storing

Problem: When a node has no closer peer (ErrWantSelf) but the chunk is outside its AOR, it stores the chunk anyway. The chunk lands in a low-proximity bin, unreserve() evicts it immediately, but the origin already has a valid receipt. Chunk is gone.

Fix: Guard with proximity(chunk, self) < rad before storing — return ErrOutOfDepthStoring instead, origin retries with next peer. (pkg/pushsync/pushsync.go:301-309)

Doubling concern (@martinconic): The check uses StorageRadius() which already accounts for doubling (CommittedDepth - doublings). Using CommittedDepth instead would incorrectly reject chunks in sister neighborhoods. Appears correct, but needs research validation for edge cases during radius transitions.


Bug 2: Shallow Receipt Short-Circuits Parallel Pushes

Problem: On any shallow receipt, pushToClosest returns immediately with ErrShallowReceipt, discarding results from other inflight multiplex pushes (up to 32). Only 1 attempt of the error budget is used. Valid receipts from parallel pushes are thrown away.

Fix: Treat shallow receipt like any other peer failure — cache it, burn the budget, wait for inflight pushes. Only return ErrShallowReceipt after the full budget is exhausted or when falling back to ErrWantSelf. (pkg/pushsync/pushsync.go:369-530)


Bug 3: False ChunkSynced After Retry Exhaustion

Problem: After 6 pusher retries, chunks are marked ChunkSynced even though no node in the correct neighborhood stored them. The upload appears complete but the chunk is lost.

Fix: Report ChunkCouldNotSync instead of ChunkSynced. (pkg/pusher/pusher.go:283-289)

Gap identified (@martinconic): upload.Report() tag switch (pkg/storer/internal/upload/uploadstore.go:609-617) doesn't handle ChunkCouldNotSync — chunk gets cleaned up from the upload store but tag counters are never updated. Uploads with undeliverable chunks will show permanently incomplete sync counts.

Possible approaches:

  • A. Add ChunkCouldNotSync case with a new Failed counter in TagItem + surface via API
  • B. Keep chunk in push queue for later retry (risk: infinite loops for undeliverable chunks)
  • C. Surface via metrics/API only, accept the tag gap

Protocol Impact

Scenario Before After
Node outside AOR, no closer peer Stores + sends receipt (evicted shortly after) Returns error, origin retries
Shallow receipt during multiplex Immediate abort, 1 of 32 attempts used Full budget exhausted before giving up
Pusher exhausts 6 retries Marked ChunkSynced (false positive) Marked ChunkCouldNotSync (correct, but tag gap)

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions