Skip to content

Persist raft currentTerm and votedFor to fix data divergence on simultaneous restart#202

Open
taurus-forever wants to merge 1 commit into
bakwc:masterfrom
taurus-forever:fix-persist-raft-term-vote
Open

Persist raft currentTerm and votedFor to fix data divergence on simultaneous restart#202
taurus-forever wants to merge 1 commit into
bakwc:masterfrom
taurus-forever:fix-persist-raft-term-vote

Conversation

@taurus-forever

Copy link
Copy Markdown

The raft paper (§5, "persistent state on all servers") requires currentTerm and votedFor to survive restarts. PySyncObj persisted only the log entries and raftCommitIndex, resetting currentTerm to 0 and votedFor to None on every start.

When ALL cluster members restart simultaneously (e.g. a deployment or host reboot of a Patroni cluster using the raft DCS), the term space collapses: every node comes back with currentTerm=0 while its journal still holds entries with high terms from the previous run. This breaks the term monotonicity assumption behind the vote "up-to-dateness" check and allows committed data to be silently overwritten:

  1. Pre-restart: log through idx 42, last term 5, all committed and applied.
  2. All 3 members restart simultaneously. Member A wins the first election at term 1 and immediately-issued writes (Patroni writes its DCS keys on startup) become entries idx 43-45 with term 1. Member C acks them: majority -> committed and applied on A and C.
  3. Member B, slightly slower to start, received nothing yet. Its election deadline fires and it campaigns at term 2 with last_log_term=5 - its stale pre-restart log looks MORE up-to-date than A/C's freshly committed term-1 entries, so A and C grant their votes.
  4. B becomes leader and its append_entries (prevLogIdx=42, prevLogTerm=5) make A and C truncate the committed-and-applied entries 43-45. In addition, raftLastApplied is never rewound, so B's replacement entries at the reused indexes 44-45 are silently skipped on A and C.

Result: the raft log converges and all future writes replicate fine, but the state machines differ permanently between members - A and C keep applied writes that B never applies, and miss B's entries at the reused indexes. For Patroni this manifests as different DCS data on different members while raft reports being in sync.

A genuinely fresh first start is not affected (no high terms on disk yet), and a single member that did not restart re-synchronizes the terms from memory - only a simultaneous (re)start of members with existing raft data is vulnerable.

Fix:

  • Journal gains setCurrentTerm(term, votedForNodeId) / getCurrentTerm() / getVotedForNodeId(). FileJournal stores both in the existing .meta file synchronously: raft requires term/vote to be durable before any message reflecting them is sent. Term changes only happen during elections, so the extra write is negligible. Existing .meta files without the new keys default to 0/None (backward compatible; the file stays readable by older versions since they ignore unknown keys).
  • SyncObj restores the state on startup. currentTerm is restored as max(persisted term, last log term) - currentTerm is always >= the term of the last log entry, so journals written by older versions (without a persisted term) are also protected after an upgrade.
  • All four term/vote mutation points (candidacy, vote grant, higher-term adoption in the request_vote and append_entries handlers) go through a helper that persists before the corresponding message is sent.
  • Defense in depth (should be unreachable with the fix): log CRITICAL when a truncation cuts below raftLastApplied (previously silent state-machine divergence), and clamp raftCommitIndex to the log end before persisting it, so a later restart cannot replay entries from a different leader epoch as already-committed.

Tests:

  • test_simultaneousRestartDataDivergence deterministically reproduces the scenario above (fails with "assert None == 'post'" on the slow member without the fix; with the fix the stale member loses the election and receives the data instead).
  • test_journalKeepsCurrentTermAndVote covers persistence and legacy-meta defaults.

Close: #201
Assisted-by: Claude:claude-5-fable

…taneous restart

The raft paper (§5, "persistent state on all servers") requires currentTerm
and votedFor to survive restarts. PySyncObj persisted only the log entries
and raftCommitIndex, resetting currentTerm to 0 and votedFor to None on
every start.

When ALL cluster members restart simultaneously (e.g. a deployment or host
reboot of a Patroni cluster using the raft DCS), the term space collapses:
every node comes back with currentTerm=0 while its journal still holds
entries with high terms from the previous run. This breaks the term
monotonicity assumption behind the vote "up-to-dateness" check and allows
committed data to be silently overwritten:

1. Pre-restart: log through idx 42, last term 5, all committed and applied.
2. All 3 members restart simultaneously. Member A wins the first election
   at term 1 and immediately-issued writes (Patroni writes its DCS keys on
   startup) become entries idx 43-45 with term 1. Member C acks them:
   majority -> committed and applied on A and C.
3. Member B, slightly slower to start, received nothing yet. Its election
   deadline fires and it campaigns at term 2 with last_log_term=5 - its
   stale pre-restart log looks MORE up-to-date than A/C's freshly
   committed term-1 entries, so A and C grant their votes.
4. B becomes leader and its append_entries (prevLogIdx=42, prevLogTerm=5)
   make A and C truncate the committed-and-applied entries 43-45. In
   addition, raftLastApplied is never rewound, so B's replacement entries
   at the reused indexes 44-45 are silently skipped on A and C.

Result: the raft log converges and all future writes replicate fine, but
the state machines differ permanently between members - A and C keep
applied writes that B never applies, and miss B's entries at the reused
indexes. For Patroni this manifests as different DCS data on different
members while raft reports being in sync.

A genuinely fresh first start is not affected (no high terms on disk yet),
and a single member that did not restart re-synchronizes the terms from
memory - only a simultaneous (re)start of members with existing raft data
is vulnerable.

Fix:
- Journal gains setCurrentTerm(term, votedForNodeId) / getCurrentTerm() /
  getVotedForNodeId(). FileJournal stores both in the existing .meta file
  synchronously: raft requires term/vote to be durable before any message
  reflecting them is sent. Term changes only happen during elections, so
  the extra write is negligible. Existing .meta files without the new keys
  default to 0/None (backward compatible; the file stays readable by older
  versions since they ignore unknown keys).
- SyncObj restores the state on startup. currentTerm is restored as
  max(persisted term, last log term) - currentTerm is always >= the term
  of the last log entry, so journals written by older versions (without a
  persisted term) are also protected after an upgrade.
- All four term/vote mutation points (candidacy, vote grant, higher-term
  adoption in the request_vote and append_entries handlers) go through a
  helper that persists before the corresponding message is sent.
- Defense in depth (should be unreachable with the fix): log CRITICAL when
  a truncation cuts below raftLastApplied (previously silent state-machine
  divergence), and clamp raftCommitIndex to the log end before persisting
  it, so a later restart cannot replay entries from a different leader
  epoch as already-committed.

Tests:
- test_simultaneousRestartDataDivergence deterministically reproduces the
  scenario above (fails with "assert None == 'post'" on the slow member
  without the fix; with the fix the stale member loses the election and
  receives the data instead).
- test_journalKeepsCurrentTermAndVote covers persistence and legacy-meta
  defaults.

Close: bakwc#201

Assisted-by: Claude:claude-5-fable
@taurus-forever

Copy link
Copy Markdown
Author

BTW, the similar fix has been discussed in #98 but didn't land into codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Complete Raft cluster reboot causing instabilities on Patroni/PostgreSQL

1 participant