Summary
patroni.dynamic.json hardcodes remove_data_directory_on_diverged_timelines: false. When a replica diverges timelines (e.g. after repeated failovers), patroni refuses to wipe and re-clone the data directory. Instead it loops forever trying to replay WAL that will never converge, accumulating WAL in data/logs indefinitely. There is no way to configure this via a Juju config option.
Observed behaviour
After 77 failovers/restores in STG, postgresql/58 was stuck at TL1 while the cluster was at TL77. The unit had accumulated 4.8 GB of WAL in data/logs from failed restore_command attempts over 24+ days. Patronictl showed the unit in starting state with unknown lag - it never self-healed.
The charm's unit status showed blocked: failed to initialize stanza, check your S3 settings - a secondary symptom from the charm attempting stanza-init while postgres was not yet accepting connections. S3 credentials and the bucket were healthy throughout.
Expected behaviour
The charm should expose remove_data_directory_on_diverged_timelines as a Juju config option. Setting it to true allows patroni to automatically wipe and re-clone a diverged replica from the primary, self-healing without operator intervention.
Steps to reproduce
- Deploy a multi-unit
charmed-postgresql cluster with pgbackrest S3 backup
- Trigger enough failovers/restores to advance the cluster timeline significantly
- Take a replica offline long enough for its WAL to diverge from the primary timeline
- Observe the replica loops in
starting state with unknown lag indefinitely - patroni will not wipe and re-clone
Impact
- Replicas that diverge timelines require manual
patronictl reinit <cluster> <unit> --force intervention to recover
- The
blocked: failed to initialize stanza charm status is a misleading secondary symptom - the real problem is the replica never coming up, not S3
- We hit this in STG after 77 timeline advances; any long-running cluster that experiences failovers is at risk
Workaround
Manual intervention required:
snap run charmed-postgresql.patronictl \
-c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml \
reinit postgresql <unit-name> --force
Environment
- Charm:
charmed-postgresql (machine operator, not k8s)
- Juju controller: JAAS / Prodstack7
- PostgreSQL: Patroni-managed cluster, 3-unit
Suggested fix
Expose remove_data_directory_on_diverged_timelines as a Juju config option (default false to preserve existing behaviour). When set to true, include it in patroni.dynamic.json so patroni will automatically wipe and re-clone diverged replicas.
Tracked: https://warthogs.atlassian.net/browse/LNDENG-4481
Summary
patroni.dynamic.jsonhardcodesremove_data_directory_on_diverged_timelines: false. When a replica diverges timelines (e.g. after repeated failovers), patroni refuses to wipe and re-clone the data directory. Instead it loops forever trying to replay WAL that will never converge, accumulating WAL indata/logsindefinitely. There is no way to configure this via a Juju config option.Observed behaviour
After 77 failovers/restores in STG,
postgresql/58was stuck at TL1 while the cluster was at TL77. The unit had accumulated 4.8 GB of WAL indata/logsfrom failedrestore_commandattempts over 24+ days. Patronictl showed the unit instartingstate withunknownlag - it never self-healed.The charm's unit status showed
blocked: failed to initialize stanza, check your S3 settings- a secondary symptom from the charm attempting stanza-init while postgres was not yet accepting connections. S3 credentials and the bucket were healthy throughout.Expected behaviour
The charm should expose
remove_data_directory_on_diverged_timelinesas a Juju config option. Setting it totrueallows patroni to automatically wipe and re-clone a diverged replica from the primary, self-healing without operator intervention.Steps to reproduce
charmed-postgresqlcluster with pgbackrest S3 backupstartingstate withunknownlag indefinitely - patroni will not wipe and re-cloneImpact
patronictl reinit <cluster> <unit> --forceintervention to recoverblocked: failed to initialize stanzacharm status is a misleading secondary symptom - the real problem is the replica never coming up, not S3Workaround
Manual intervention required:
Environment
charmed-postgresql(machine operator, not k8s)Suggested fix
Expose
remove_data_directory_on_diverged_timelinesas a Juju config option (defaultfalseto preserve existing behaviour). When set totrue, include it inpatroni.dynamic.jsonso patroni will automatically wipe and re-clone diverged replicas.Tracked: https://warthogs.atlassian.net/browse/LNDENG-4481