Persist raft currentTerm and votedFor to fix data divergence on simultaneous restart#202
Open
taurus-forever wants to merge 1 commit into
Open
Persist raft currentTerm and votedFor to fix data divergence on simultaneous restart#202taurus-forever wants to merge 1 commit into
taurus-forever wants to merge 1 commit into
Conversation
…taneous restart The raft paper (§5, "persistent state on all servers") requires currentTerm and votedFor to survive restarts. PySyncObj persisted only the log entries and raftCommitIndex, resetting currentTerm to 0 and votedFor to None on every start. When ALL cluster members restart simultaneously (e.g. a deployment or host reboot of a Patroni cluster using the raft DCS), the term space collapses: every node comes back with currentTerm=0 while its journal still holds entries with high terms from the previous run. This breaks the term monotonicity assumption behind the vote "up-to-dateness" check and allows committed data to be silently overwritten: 1. Pre-restart: log through idx 42, last term 5, all committed and applied. 2. All 3 members restart simultaneously. Member A wins the first election at term 1 and immediately-issued writes (Patroni writes its DCS keys on startup) become entries idx 43-45 with term 1. Member C acks them: majority -> committed and applied on A and C. 3. Member B, slightly slower to start, received nothing yet. Its election deadline fires and it campaigns at term 2 with last_log_term=5 - its stale pre-restart log looks MORE up-to-date than A/C's freshly committed term-1 entries, so A and C grant their votes. 4. B becomes leader and its append_entries (prevLogIdx=42, prevLogTerm=5) make A and C truncate the committed-and-applied entries 43-45. In addition, raftLastApplied is never rewound, so B's replacement entries at the reused indexes 44-45 are silently skipped on A and C. Result: the raft log converges and all future writes replicate fine, but the state machines differ permanently between members - A and C keep applied writes that B never applies, and miss B's entries at the reused indexes. For Patroni this manifests as different DCS data on different members while raft reports being in sync. A genuinely fresh first start is not affected (no high terms on disk yet), and a single member that did not restart re-synchronizes the terms from memory - only a simultaneous (re)start of members with existing raft data is vulnerable. Fix: - Journal gains setCurrentTerm(term, votedForNodeId) / getCurrentTerm() / getVotedForNodeId(). FileJournal stores both in the existing .meta file synchronously: raft requires term/vote to be durable before any message reflecting them is sent. Term changes only happen during elections, so the extra write is negligible. Existing .meta files without the new keys default to 0/None (backward compatible; the file stays readable by older versions since they ignore unknown keys). - SyncObj restores the state on startup. currentTerm is restored as max(persisted term, last log term) - currentTerm is always >= the term of the last log entry, so journals written by older versions (without a persisted term) are also protected after an upgrade. - All four term/vote mutation points (candidacy, vote grant, higher-term adoption in the request_vote and append_entries handlers) go through a helper that persists before the corresponding message is sent. - Defense in depth (should be unreachable with the fix): log CRITICAL when a truncation cuts below raftLastApplied (previously silent state-machine divergence), and clamp raftCommitIndex to the log end before persisting it, so a later restart cannot replay entries from a different leader epoch as already-committed. Tests: - test_simultaneousRestartDataDivergence deterministically reproduces the scenario above (fails with "assert None == 'post'" on the slow member without the fix; with the fix the stale member loses the election and receives the data instead). - test_journalKeepsCurrentTermAndVote covers persistence and legacy-meta defaults. Close: bakwc#201 Assisted-by: Claude:claude-5-fable
Author
|
BTW, the similar fix has been discussed in #98 but didn't land into codebase. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The raft paper (§5, "persistent state on all servers") requires currentTerm and votedFor to survive restarts. PySyncObj persisted only the log entries and raftCommitIndex, resetting currentTerm to 0 and votedFor to None on every start.
When ALL cluster members restart simultaneously (e.g. a deployment or host reboot of a Patroni cluster using the raft DCS), the term space collapses: every node comes back with currentTerm=0 while its journal still holds entries with high terms from the previous run. This breaks the term monotonicity assumption behind the vote "up-to-dateness" check and allows committed data to be silently overwritten:
Result: the raft log converges and all future writes replicate fine, but the state machines differ permanently between members - A and C keep applied writes that B never applies, and miss B's entries at the reused indexes. For Patroni this manifests as different DCS data on different members while raft reports being in sync.
A genuinely fresh first start is not affected (no high terms on disk yet), and a single member that did not restart re-synchronizes the terms from memory - only a simultaneous (re)start of members with existing raft data is vulnerable.
Fix:
Tests:
Close: #201
Assisted-by: Claude:claude-5-fable