Skip to content

Charm does not expose remove_data_directory_on_diverged_timelines - diverged replicas loop forever #1728

Description

@jansdhillon

Summary

patroni.dynamic.json hardcodes remove_data_directory_on_diverged_timelines: false. When a replica diverges timelines (e.g. after repeated failovers), patroni refuses to wipe and re-clone the data directory. Instead it loops forever trying to replay WAL that will never converge, accumulating WAL in data/logs indefinitely. There is no way to configure this via a Juju config option.

Observed behaviour

After 77 failovers/restores in STG, postgresql/58 was stuck at TL1 while the cluster was at TL77. The unit had accumulated 4.8 GB of WAL in data/logs from failed restore_command attempts over 24+ days. Patronictl showed the unit in starting state with unknown lag - it never self-healed.

The charm's unit status showed blocked: failed to initialize stanza, check your S3 settings - a secondary symptom from the charm attempting stanza-init while postgres was not yet accepting connections. S3 credentials and the bucket were healthy throughout.

Expected behaviour

The charm should expose remove_data_directory_on_diverged_timelines as a Juju config option. Setting it to true allows patroni to automatically wipe and re-clone a diverged replica from the primary, self-healing without operator intervention.

Steps to reproduce

  1. Deploy a multi-unit charmed-postgresql cluster with pgbackrest S3 backup
  2. Trigger enough failovers/restores to advance the cluster timeline significantly
  3. Take a replica offline long enough for its WAL to diverge from the primary timeline
  4. Observe the replica loops in starting state with unknown lag indefinitely - patroni will not wipe and re-clone

Impact

  • Replicas that diverge timelines require manual patronictl reinit <cluster> <unit> --force intervention to recover
  • The blocked: failed to initialize stanza charm status is a misleading secondary symptom - the real problem is the replica never coming up, not S3
  • We hit this in STG after 77 timeline advances; any long-running cluster that experiences failovers is at risk

Workaround

Manual intervention required:

snap run charmed-postgresql.patronictl \
  -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml \
  reinit postgresql <unit-name> --force

Environment

  • Charm: charmed-postgresql (machine operator, not k8s)
  • Juju controller: JAAS / Prodstack7
  • PostgreSQL: Patroni-managed cluster, 3-unit

Suggested fix

Expose remove_data_directory_on_diverged_timelines as a Juju config option (default false to preserve existing behaviour). When set to true, include it in patroni.dynamic.json so patroni will automatically wipe and re-clone diverged replicas.

Tracked: https://warthogs.atlassian.net/browse/LNDENG-4481

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions