Charm does not expose remove_data_directory_on_diverged_timelines - diverged replicas loop forever

## Summary

`patroni.dynamic.json` hardcodes `remove_data_directory_on_diverged_timelines: false`. When a replica diverges timelines (e.g. after repeated failovers), patroni refuses to wipe and re-clone the data directory. Instead it loops forever trying to replay WAL that will never converge, accumulating WAL in `data/logs` indefinitely. There is no way to configure this via a Juju config option.

## Observed behaviour

After 77 failovers/restores in STG, `postgresql/58` was stuck at TL1 while the cluster was at TL77. The unit had accumulated 4.8 GB of WAL in `data/logs` from failed `restore_command` attempts over 24+ days. Patronictl showed the unit in `starting` state with `unknown` lag - it never self-healed.

The charm's unit status showed `blocked: failed to initialize stanza, check your S3 settings` - a secondary symptom from the charm attempting stanza-init while postgres was not yet accepting connections. S3 credentials and the bucket were healthy throughout.

## Expected behaviour

The charm should expose `remove_data_directory_on_diverged_timelines` as a Juju config option. Setting it to `true` allows patroni to automatically wipe and re-clone a diverged replica from the primary, self-healing without operator intervention.

## Steps to reproduce

1. Deploy a multi-unit `charmed-postgresql` cluster with pgbackrest S3 backup
2. Trigger enough failovers/restores to advance the cluster timeline significantly
3. Take a replica offline long enough for its WAL to diverge from the primary timeline
4. Observe the replica loops in `starting` state with `unknown` lag indefinitely - patroni will not wipe and re-clone

## Impact

- Replicas that diverge timelines require manual `patronictl reinit <cluster> <unit> --force` intervention to recover
- The `blocked: failed to initialize stanza` charm status is a misleading secondary symptom - the real problem is the replica never coming up, not S3
- We hit this in STG after 77 timeline advances; any long-running cluster that experiences failovers is at risk

## Workaround

Manual intervention required:

```bash
snap run charmed-postgresql.patronictl \
  -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml \
  reinit postgresql <unit-name> --force
```

## Environment

- Charm: `charmed-postgresql` (machine operator, not k8s)
- Juju controller: JAAS / Prodstack7
- PostgreSQL: Patroni-managed cluster, 3-unit

## Suggested fix

Expose `remove_data_directory_on_diverged_timelines` as a Juju config option (default `false` to preserve existing behaviour). When set to `true`, include it in `patroni.dynamic.json` so patroni will automatically wipe and re-clone diverged replicas.

Tracked: https://warthogs.atlassian.net/browse/LNDENG-4481

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Charm does not expose remove_data_directory_on_diverged_timelines - diverged replicas loop forever #1728

Summary

Observed behaviour

Expected behaviour

Steps to reproduce

Impact

Workaround

Environment

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Charm does not expose remove_data_directory_on_diverged_timelines - diverged replicas loop forever #1728

Description

Summary

Observed behaviour

Expected behaviour

Steps to reproduce

Impact

Workaround

Environment

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions