Persist raft currentTerm and votedFor to fix data divergence on simultaneous restart by taurus-forever · Pull Request #202 · bakwc/PySyncObj

taurus-forever · 2026-07-03T22:16:18Z

The raft paper (§5, "persistent state on all servers") requires currentTerm and votedFor to survive restarts. PySyncObj persisted only the log entries and raftCommitIndex, resetting currentTerm to 0 and votedFor to None on every start.

When ALL cluster members restart simultaneously (e.g. a deployment or host reboot of a Patroni cluster using the raft DCS), the term space collapses: every node comes back with currentTerm=0 while its journal still holds entries with high terms from the previous run. This breaks the term monotonicity assumption behind the vote "up-to-dateness" check and allows committed data to be silently overwritten:

Pre-restart: log through idx 42, last term 5, all committed and applied.
All 3 members restart simultaneously. Member A wins the first election at term 1 and immediately-issued writes (Patroni writes its DCS keys on startup) become entries idx 43-45 with term 1. Member C acks them: majority -> committed and applied on A and C.
Member B, slightly slower to start, received nothing yet. Its election deadline fires and it campaigns at term 2 with last_log_term=5 - its stale pre-restart log looks MORE up-to-date than A/C's freshly committed term-1 entries, so A and C grant their votes.
B becomes leader and its append_entries (prevLogIdx=42, prevLogTerm=5) make A and C truncate the committed-and-applied entries 43-45. In addition, raftLastApplied is never rewound, so B's replacement entries at the reused indexes 44-45 are silently skipped on A and C.

Result: the raft log converges and all future writes replicate fine, but the state machines differ permanently between members - A and C keep applied writes that B never applies, and miss B's entries at the reused indexes. For Patroni this manifests as different DCS data on different members while raft reports being in sync.

A genuinely fresh first start is not affected (no high terms on disk yet), and a single member that did not restart re-synchronizes the terms from memory - only a simultaneous (re)start of members with existing raft data is vulnerable.

Fix:

Journal gains setCurrentTerm(term, votedForNodeId) / getCurrentTerm() / getVotedForNodeId(). FileJournal stores both in the existing .meta file synchronously: raft requires term/vote to be durable before any message reflecting them is sent. Term changes only happen during elections, so the extra write is negligible. Existing .meta files without the new keys default to 0/None (backward compatible; the file stays readable by older versions since they ignore unknown keys).
SyncObj restores the state on startup. currentTerm is restored as max(persisted term, last log term) - currentTerm is always >= the term of the last log entry, so journals written by older versions (without a persisted term) are also protected after an upgrade.
All four term/vote mutation points (candidacy, vote grant, higher-term adoption in the request_vote and append_entries handlers) go through a helper that persists before the corresponding message is sent.
Defense in depth (should be unreachable with the fix): log CRITICAL when a truncation cuts below raftLastApplied (previously silent state-machine divergence), and clamp raftCommitIndex to the log end before persisting it, so a later restart cannot replay entries from a different leader epoch as already-committed.

Tests:

test_simultaneousRestartDataDivergence deterministically reproduces the scenario above (fails with "assert None == 'post'" on the slow member without the fix; with the fix the stale member loses the election and receives the data instead).
test_journalKeepsCurrentTermAndVote covers persistence and legacy-meta defaults.

Close: #201
Assisted-by: Claude:claude-5-fable

…taneous restart The raft paper (§5, "persistent state on all servers") requires currentTerm and votedFor to survive restarts. PySyncObj persisted only the log entries and raftCommitIndex, resetting currentTerm to 0 and votedFor to None on every start. When ALL cluster members restart simultaneously (e.g. a deployment or host reboot of a Patroni cluster using the raft DCS), the term space collapses: every node comes back with currentTerm=0 while its journal still holds entries with high terms from the previous run. This breaks the term monotonicity assumption behind the vote "up-to-dateness" check and allows committed data to be silently overwritten: 1. Pre-restart: log through idx 42, last term 5, all committed and applied. 2. All 3 members restart simultaneously. Member A wins the first election at term 1 and immediately-issued writes (Patroni writes its DCS keys on startup) become entries idx 43-45 with term 1. Member C acks them: majority -> committed and applied on A and C. 3. Member B, slightly slower to start, received nothing yet. Its election deadline fires and it campaigns at term 2 with last_log_term=5 - its stale pre-restart log looks MORE up-to-date than A/C's freshly committed term-1 entries, so A and C grant their votes. 4. B becomes leader and its append_entries (prevLogIdx=42, prevLogTerm=5) make A and C truncate the committed-and-applied entries 43-45. In addition, raftLastApplied is never rewound, so B's replacement entries at the reused indexes 44-45 are silently skipped on A and C. Result: the raft log converges and all future writes replicate fine, but the state machines differ permanently between members - A and C keep applied writes that B never applies, and miss B's entries at the reused indexes. For Patroni this manifests as different DCS data on different members while raft reports being in sync. A genuinely fresh first start is not affected (no high terms on disk yet), and a single member that did not restart re-synchronizes the terms from memory - only a simultaneous (re)start of members with existing raft data is vulnerable. Fix: - Journal gains setCurrentTerm(term, votedForNodeId) / getCurrentTerm() / getVotedForNodeId(). FileJournal stores both in the existing .meta file synchronously: raft requires term/vote to be durable before any message reflecting them is sent. Term changes only happen during elections, so the extra write is negligible. Existing .meta files without the new keys default to 0/None (backward compatible; the file stays readable by older versions since they ignore unknown keys). - SyncObj restores the state on startup. currentTerm is restored as max(persisted term, last log term) - currentTerm is always >= the term of the last log entry, so journals written by older versions (without a persisted term) are also protected after an upgrade. - All four term/vote mutation points (candidacy, vote grant, higher-term adoption in the request_vote and append_entries handlers) go through a helper that persists before the corresponding message is sent. - Defense in depth (should be unreachable with the fix): log CRITICAL when a truncation cuts below raftLastApplied (previously silent state-machine divergence), and clamp raftCommitIndex to the log end before persisting it, so a later restart cannot replay entries from a different leader epoch as already-committed. Tests: - test_simultaneousRestartDataDivergence deterministically reproduces the scenario above (fails with "assert None == 'post'" on the slow member without the fix; with the fix the stale member loses the election and receives the data instead). - test_journalKeepsCurrentTermAndVote covers persistence and legacy-meta defaults. Close: bakwc#201 Assisted-by: Claude:claude-5-fable

taurus-forever · 2026-07-04T19:46:33Z

BTW, the similar fix has been discussed in #98 but didn't land into codebase.

taurus-forever mentioned this pull request Jul 3, 2026

Complete Raft cluster reboot causing instabilities on Patroni/PostgreSQL #201

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist raft currentTerm and votedFor to fix data divergence on simultaneous restart#202

Persist raft currentTerm and votedFor to fix data divergence on simultaneous restart#202
taurus-forever wants to merge 1 commit into
bakwc:masterfrom
taurus-forever:fix-persist-raft-term-vote

taurus-forever commented Jul 3, 2026

Uh oh!

taurus-forever commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taurus-forever commented Jul 3, 2026

Uh oh!

taurus-forever commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant