Charm stays blocked: failed to initialize stanza permanently after S3 recovers - no self-healing

## Summary

When pgbackrest fails to initialize the S3 backup stanza (e.g. due to a transient S3 auth issue, network blip, or credential rotation), the charm correctly sets its status to `blocked`. However, once the underlying issue resolves and the stanza is healthy again, **the charm never clears the blocked status on its own**. The unit stays `blocked` indefinitely, even though `pgbackrest info` returns `status: ok` and backups are completing successfully.

This has happened to us repeatedly across both STG and PROD environments. Each time it requires direct operator intervention.

## Observed behaviour

```
postgresql/37*   blocked   idle   ...   10.152.88.119   5432/tcp   failed to initialize stanza, check your S3 settings
```

Running the pgbackrest check directly on the unit shows the stanza is fine:

```bash
sudo -u _daemon_ env LD_LIBRARY_PATH=... /snap/charmed-postgresql/current/usr/bin/pgbackrest \
  --config=/var/snap/charmed-postgresql/common/etc/pgbackrest/pgbackrest.conf \
  --stanza=prod-landscape-saas-ps7.postgresql \
  info
# → status: ok
```

The `stanza-create` log is empty (no failed run after recovery). The charm simply never re-checks.

## Expected behaviour

The `update-status` hook (or a dedicated periodic check) should re-run the stanza health check and clear the `blocked` status once the stanza is `ok`. The charm should be able to self-heal without operator intervention.

## Steps to reproduce

1. Deploy `charmed-postgresql` with a pgbackrest S3 backend
2. Temporarily break S3 access (rotate creds, revoke IAM permissions, simulate a network partition, etc.)
3. Observe charm transitions to `blocked: failed to initialize stanza`
4. Restore S3 access — stanza becomes healthy (`pgbackrest info` returns `status: ok`, backups run successfully)
5. Observe charm status **remains** `blocked` indefinitely

## Impact

- Silent loss of backup monitoring: operators assume backups are broken when they are not
- Requires manual intervention every time a transient S3 issue occurs (not acceptable for production)
- We have hit this in both STG and PROD on back-to-back days; it appears to be the normal failure mode for any S3 disruption

## Environment

- Charm: `charmed-postgresql` (machine operator, not k8s)
- Juju controller: JAAS / Prodstack7
- Backend: Ceph RadosGW (S3-compatible)
- PostgreSQL: Patroni-managed cluster, 3-unit STG / 2-unit PROD

## Suggested fix

In the `update-status` hook: if the unit is currently `blocked` with a stanza-related message, re-run the stanza check. If `pgbackrest info` returns `status: ok`, clear the blocked status and set `active`. This is a low-risk read-only check that should run on every `update-status` interval.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Charm stays blocked: failed to initialize stanza permanently after S3 recovers - no self-healing #1724

Summary

Observed behaviour

Expected behaviour

Steps to reproduce

Impact

Environment

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Charm stays blocked: failed to initialize stanza permanently after S3 recovers - no self-healing #1724

Description

Summary

Observed behaviour

Expected behaviour

Steps to reproduce

Impact

Environment

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions