Skip to content

fix: use snapshot metadata for prev-log boundary checks#5

Draft
anyasabo wants to merge 1 commit into
mainfrom
fix/append-prevlog-snapshot-boundary
Draft

fix: use snapshot metadata for prev-log boundary checks#5
anyasabo wants to merge 1 commit into
mainfrom
fix/append-prevlog-snapshot-boundary

Conversation

@anyasabo
Copy link
Copy Markdown
Owner

@anyasabo anyasabo commented May 6, 2026

Problem

appendEntries prev-log validation rejects valid replay at the snapshot boundary because compacted entries are no longer present in LogStore.

Pre-fix behavior effectively did:

  • if PrevLogEntry == lastLogIndex: compare term,
  • else: GetLog(PrevLogEntry).

When PrevLogEntry equals a compacted snapshot boundary index, GetLog fails even though snapshot metadata should satisfy the check.

How this can happen in concrete terms

  1. Follower compacts logs up to snapshot index S.
  2. Leader sends append replay with PrevLogEntry = S, PrevLogTerm = snapshotTerm.
  3. Follower tries to load index S from log store (not present after compaction).
  4. Follower incorrectly rejects append despite valid boundary match.

This is a natural post-snapshot catch-up path (no adversarial input required).

Mitigations that help (but do not fix root cause)

  • Higher trailing-log retention can reduce frequency.
  • Snapshot cadence tuning can reduce boundary encounters.

Neither removes the correctness issue; boundary checks must use snapshot metadata.

Impact

Replication liveness and resilience risk:

  • lagging followers can stall at catch-up boundaries,
  • repeated rejection/retry loops,
  • reduced redundancy and slower recovery after partitions/restarts.

How we would notice in production

  • Warnings like failed to get previous log where previous-index is at/near snapshot boundary.
  • Followers repeatedly failing to advance after snapshot replay.
  • Repeated install-snapshot / append retry churn for the same peers.

Provenance / Preconditions

  • Longstanding logic from early append implementation (fdb8145f, 2016).
  • No special config required beyond normal snapshotting/compaction behavior.
  • Trigger condition: PrevLogEntry at or below compacted snapshot index.

What this PR changes

  • Uses snapshot metadata for prev-log validation:
    • below snapshot index: treat as compacted match for catch-up progress,
    • exactly at snapshot index: validate against snapshot term,
    • above snapshot index: use existing log-store checks.
  • Adds deterministic tests for matching and mismatched boundary-term behavior.

Reviewer reproduction (live in-process Raft state)

Reproduce pre-fix behavior

  1. Checkout parent commit:
    • git checkout b5e9b90^
  2. Bring this PR's two boundary tests into that tree:
    • git checkout fix/append-prevlog-snapshot-boundary -- raft_test.go
  3. Run:
    • go test -run "TestRaft_AppendEntries(AcceptsPrevLogAtSnapshotBoundary|RejectsPrevLogAtSnapshotBoundaryWithWrongTerm)$" -count=1 .
  4. Expected pre-fix: matching-boundary case fails (false reject).

Verify fixed behavior

  1. Checkout this branch (fix/append-prevlog-snapshot-boundary).
  2. Run:
    • go test -run "TestRaft_AppendEntries(AcceptsPrevLogAtSnapshotBoundary|RejectsPrevLogAtSnapshotBoundaryWithWrongTerm)$" -count=1 .
  3. Expected: matching case accepted, mismatched-term case rejected.

Test plan

  • go test -run "TestRaft_AppendEntries(AcceptsPrevLogAtSnapshotBoundary|RejectsPrevLogAtSnapshotBoundaryWithWrongTerm)$" -count=1 .
  • go test -run "TestRaft_AppendEntry$|TestRaft_AppendEntries(AcceptsPrevLogAtSnapshotBoundary|RejectsPrevLogAtSnapshotBoundaryWithWrongTerm)$" -count=1 .

appendEntries can incorrectly reject replay when PrevLogEntry equals or predates the last snapshot boundary, because compacted entries are no longer in LogStore. Use snapshot metadata for boundary comparisons and add deterministic tests for matching and mismatched boundary terms.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant