Skip to content

Charm stays blocked: failed to initialize stanza permanently after S3 recovers - no self-healing #1724

Description

@jansdhillon

Summary

When pgbackrest fails to initialize the S3 backup stanza (e.g. due to a transient S3 auth issue, network blip, or credential rotation), the charm correctly sets its status to blocked. However, once the underlying issue resolves and the stanza is healthy again, the charm never clears the blocked status on its own. The unit stays blocked indefinitely, even though pgbackrest info returns status: ok and backups are completing successfully.

This has happened to us repeatedly across both STG and PROD environments. Each time it requires direct operator intervention.

Observed behaviour

postgresql/37*   blocked   idle   ...   10.152.88.119   5432/tcp   failed to initialize stanza, check your S3 settings

Running the pgbackrest check directly on the unit shows the stanza is fine:

sudo -u _daemon_ env LD_LIBRARY_PATH=... /snap/charmed-postgresql/current/usr/bin/pgbackrest \
  --config=/var/snap/charmed-postgresql/common/etc/pgbackrest/pgbackrest.conf \
  --stanza=prod-landscape-saas-ps7.postgresql \
  info
# → status: ok

The stanza-create log is empty (no failed run after recovery). The charm simply never re-checks.

Expected behaviour

The update-status hook (or a dedicated periodic check) should re-run the stanza health check and clear the blocked status once the stanza is ok. The charm should be able to self-heal without operator intervention.

Steps to reproduce

  1. Deploy charmed-postgresql with a pgbackrest S3 backend
  2. Temporarily break S3 access (rotate creds, revoke IAM permissions, simulate a network partition, etc.)
  3. Observe charm transitions to blocked: failed to initialize stanza
  4. Restore S3 access — stanza becomes healthy (pgbackrest info returns status: ok, backups run successfully)
  5. Observe charm status remains blocked indefinitely

Impact

  • Silent loss of backup monitoring: operators assume backups are broken when they are not
  • Requires manual intervention every time a transient S3 issue occurs (not acceptable for production)
  • We have hit this in both STG and PROD on back-to-back days; it appears to be the normal failure mode for any S3 disruption

Environment

  • Charm: charmed-postgresql (machine operator, not k8s)
  • Juju controller: JAAS / Prodstack7
  • Backend: Ceph RadosGW (S3-compatible)
  • PostgreSQL: Patroni-managed cluster, 3-unit STG / 2-unit PROD

Suggested fix

In the update-status hook: if the unit is currently blocked with a stanza-related message, re-run the stanza check. If pgbackrest info returns status: ok, clear the blocked status and set active. This is a low-risk read-only check that should run on every update-status interval.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions