Complete Raft cluster reboot causing instabilities on Patroni/PostgreSQL

Hi,

[Charmed PostgreSQL operator](https://documentation.ubuntu.com/charmed-postgresql/16/) uses PySyncObj with Patroni to manage PostgreSQL cluster.

We have noticed list of different instabilities during the complete cluster outage
(an entire cluster stopped and the started from cold&dark).
Traced issues to Raft library:
 * Rarely, Patroni is unable to start/connect Raft (when all three Raft members starts simultaneously)
 * Rarely, newly elected Primary address, which is pushed into Raft/DCS on boot, is missing on some Raft members (however all members are healthy, commit_idx is growing and in sync on all Raft members). 
See the diff for DCS content between Raft members. E.g. `leader` is missing on left member (id 4667 is available on right side member, but bigger ID 4694 is available on left side already):

<img width="2758" height="1546" alt="Image" src="https://github.com/user-attachments/assets/d3556dbb-f9f1-4d5f-af11-82eff4a0084e" />

There are no clear steps-to-reproduce for such issues, but with probablility 5% it can be reproduces on complete cluster restart.

Investigating the possible reasons with Claude Fable 5, the interesting point spotted:

* The Raft [spec](https://raft.github.io/raft.pdf) enforces `currentTerm` (latest term server has seen) to be set as zero on the first boot only:

<img width="452" height="133" alt="Image" src="https://github.com/user-attachments/assets/cd503223-8653-48f2-8bec-bc4b87dd3d1e" />

* The current Raft library implementation set it to zero on every boot/construction, see pysyncobj/syncobj.py:158-159
Please check below the complete and detailed Claude investigation report.

Hacking with the possible solution, we came the the Raft library changes which is incredibly stable during our active testing (see [PR](https://github.com/bakwc/PySyncObj/pull/202)).
Those changes have been manually tested and re-checked on our CI/CD using patched PySyncObj 0.3.15, Patroni 3.3.8, PostgreSQL 16, Ubuntu 24.04.

It would be incredible to merge our [pull-request](https://github.com/bakwc/PySyncObj/pull/202), we are happy to adapt and further test necessary changes!
Thank you in advance!

----

### Claude Fable 5 Investigation report

PySyncObj: state-machine divergence after simultaneous cluster restart

#### Context

 Patroni (raft DCS) users observe that when all 3 raft members are (re)started simultaneously,
 the DCS data can end up different on different members, while the raft log is fully in
 sync and all future writes replicate correctly. Task: find the exact case and fix it.

#### Root cause

 Raft's persistent state currentTerm and votedFor are never persisted.
 pysyncobj/syncobj.py:158-159 initializes them to 0/None on every construction; the
 FileJournal meta (pysyncobj/journal.py:242-247) persists only raftCommitIndex. The log
 entries (with their terms) are persisted.

 Raft (§5 "persistent state on all servers") requires currentTerm and votedFor to survive
 restarts. Without that, a full simultaneous restart collapses the term space: every node
 comes back with currentTerm=0 while its journal's last entry carries a high term (e.g. term 5).
 Term monotonicity across leader transitions — the invariant behind the vote "up-to-dateness"
 check and the Log Matching property — is broken.

 #### Concrete divergence trace (verified against the code)

 Pre-restart: nodes A, B, C; log through idx 42, last term 5; all committed/applied; journals on disk.

 1. All 3 restart simultaneously (deploy/reboot). Each restores its log (last = idx42, t5),
 commitIndex≈42 from meta, replays the journal to rebuild its state machine — but
 currentTerm=0, votedFor=None.
 2. A wins the first election at term 1 (syncobj.py:579-596), appends NO_OP idx43 (t1).
 B is a bit slow (connections still forming — normal during simultaneous startup).
 3. Patroni on A immediately writes DCS keys (member/leader/config) → entries idx44-45 (t1).
 C acks → majority (A+C), commitTerm == currentTerm (syncobj.py:615) → committed and
 applied on A and C.
 4. B, having received no post-restart append_entries yet, hits its election deadline and becomes
 candidate at term 2 with last_log_term=5, last_log_index=42.
 5. Vote check (syncobj.py:868-871): B's lastLogTerm=5 vs A/C's current log term 1 →
 B's stale pre-restart log looks more up-to-date → A and C grant the vote; A steps down.
 6. B (leader term 2, log ends idx42/t5) sends append_entries prevLogIdx=42, prevLogTerm=5,
 entries [43(t2)]. A and C pass the consistency check and truncate their applied entries
 43-45(t1) (syncobj.py:925-935).
 7. Result: A and C have applied writes W1 that B never applies. Additionally raftLastApplied
 (45) and raftCommitIndex are never rewound (syncobj.py:956 only raises commitIndex), so
 B's new entries reusing idx 44-45 are silently skipped on A and C
 (__applyLogEntries starts at lastApplied+1 = 46).

 Raft log converges everywhere; all future writes are fine; the state machines (Patroni DCS data)
 differ permanently (until each key happens to be overwritten, e.g. by TTL refresh).

#### Why only "in some cases" reproducible

The window is "leader elected + Patroni writes committed before the last member
receives its first post-restart append_entries" — seconds wide at simultaneous startup, but not always hit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete Raft cluster reboot causing instabilities on Patroni/PostgreSQL #201

Claude Fable 5 Investigation report

Context

Root cause

Concrete divergence trace (verified against the code)

Why only "in some cases" reproducible

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Complete Raft cluster reboot causing instabilities on Patroni/PostgreSQL #201

Description

Claude Fable 5 Investigation report

Context

Root cause

Concrete divergence trace (verified against the code)

Why only "in some cases" reproducible

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions