Skip to content

Complete Raft cluster reboot causing instabilities on Patroni/PostgreSQL #201

Description

@taurus-forever

Hi,

Charmed PostgreSQL operator uses PySyncObj with Patroni to manage PostgreSQL cluster.

We have noticed list of different instabilities during the complete cluster outage
(an entire cluster stopped and the started from cold&dark).
Traced issues to Raft library:

  • Rarely, Patroni is unable to start/connect Raft (when all three Raft members starts simultaneously)
  • Rarely, newly elected Primary address, which is pushed into Raft/DCS on boot, is missing on some Raft members (however all members are healthy, commit_idx is growing and in sync on all Raft members).
    See the diff for DCS content between Raft members. E.g. leader is missing on left member (id 4667 is available on right side member, but bigger ID 4694 is available on left side already):
Image

There are no clear steps-to-reproduce for such issues, but with probablility 5% it can be reproduces on complete cluster restart.

Investigating the possible reasons with Claude Fable 5, the interesting point spotted:

  • The Raft spec enforces currentTerm (latest term server has seen) to be set as zero on the first boot only:
Image
  • The current Raft library implementation set it to zero on every boot/construction, see pysyncobj/syncobj.py:158-159
    Please check below the complete and detailed Claude investigation report.

Hacking with the possible solution, we came the the Raft library changes which is incredibly stable during our active testing (see PR).
Those changes have been manually tested and re-checked on our CI/CD using patched PySyncObj 0.3.15, Patroni 3.3.8, PostgreSQL 16, Ubuntu 24.04.

It would be incredible to merge our pull-request, we are happy to adapt and further test necessary changes!
Thank you in advance!


Claude Fable 5 Investigation report

PySyncObj: state-machine divergence after simultaneous cluster restart

Context

Patroni (raft DCS) users observe that when all 3 raft members are (re)started simultaneously,
the DCS data can end up different on different members, while the raft log is fully in
sync and all future writes replicate correctly. Task: find the exact case and fix it.

Root cause

Raft's persistent state currentTerm and votedFor are never persisted.
pysyncobj/syncobj.py:158-159 initializes them to 0/None on every construction; the
FileJournal meta (pysyncobj/journal.py:242-247) persists only raftCommitIndex. The log
entries (with their terms) are persisted.

Raft (§5 "persistent state on all servers") requires currentTerm and votedFor to survive
restarts. Without that, a full simultaneous restart collapses the term space: every node
comes back with currentTerm=0 while its journal's last entry carries a high term (e.g. term 5).
Term monotonicity across leader transitions — the invariant behind the vote "up-to-dateness"
check and the Log Matching property — is broken.

Concrete divergence trace (verified against the code)

Pre-restart: nodes A, B, C; log through idx 42, last term 5; all committed/applied; journals on disk.

  1. All 3 restart simultaneously (deploy/reboot). Each restores its log (last = idx42, t5),
    commitIndex≈42 from meta, replays the journal to rebuild its state machine — but
    currentTerm=0, votedFor=None.
  2. A wins the first election at term 1 (syncobj.py:579-596), appends NO_OP idx43 (t1).
    B is a bit slow (connections still forming — normal during simultaneous startup).
  3. Patroni on A immediately writes DCS keys (member/leader/config) → entries idx44-45 (t1).
    C acks → majority (A+C), commitTerm == currentTerm (syncobj.py:615) → committed and
    applied on A and C.
  4. B, having received no post-restart append_entries yet, hits its election deadline and becomes
    candidate at term 2 with last_log_term=5, last_log_index=42.
  5. Vote check (syncobj.py:868-871): B's lastLogTerm=5 vs A/C's current log term 1 →
    B's stale pre-restart log looks more up-to-date → A and C grant the vote; A steps down.
  6. B (leader term 2, log ends idx42/t5) sends append_entries prevLogIdx=42, prevLogTerm=5,
    entries [43(t2)]. A and C pass the consistency check and truncate their applied entries
    43-45(t1) (syncobj.py:925-935).
  7. Result: A and C have applied writes W1 that B never applies. Additionally raftLastApplied
    (45) and raftCommitIndex are never rewound (syncobj.py:956 only raises commitIndex), so
    B's new entries reusing idx 44-45 are silently skipped on A and C
    (__applyLogEntries starts at lastApplied+1 = 46).

Raft log converges everywhere; all future writes are fine; the state machines (Patroni DCS data)
differ permanently (until each key happens to be overwritten, e.g. by TTL refresh).

Why only "in some cases" reproducible

The window is "leader elected + Patroni writes committed before the last member
receives its first post-restart append_entries" — seconds wide at simultaneous startup, but not always hit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions