Skip to content

livepeer-api: add AMQP-aware liveness probe (/healthz/amqp) to detect broken publish channels #2350

@seanhanca

Description

@seanhanca

Summary

We hit a production incident on 2026-05-02 where 2 of 4 prod-livepeer-api pods in fra had broken AMQP publish channels but the existing /healthz endpoint kept returning 200. Result: every request-upload and webhook publish that landed on those pods hung 60s waiting on a publisher confirm that never arrived → Cloudflare 524 → users saw "network error" and assets stuck in uploading phase indefinitely.

The pods recovered only after a manual kubectl delete pod.

Evidence

  • Stack trace from livepeer-api logs:
    Error: Timeout publishing message events.stream.idle to queue
      at o._channelPublish (/usr/local/src/store/queue.ts:336:19)
      at o.publish (/usr/local/src/store/queue.ts:239:5)
      at o.publishWebhook (/usr/local/src/store/queue.ts:226:5)
    
  • Affected pods (uptime 8-12h since last restart) had 80 / 72 timeouts in 10 min.
  • Healthy pods on the same deployment had 0.
  • AMQP cluster itself was healthy: nodes 11-50% memory, no alarms, normal publish rate cluster-wide.
  • Restarting the 2 affected pods immediately resolved the issue (0 timeouts after restart).

Root cause hypothesis

After a TCP reconnect, amqp-connection-manager can land the recovered channel in a half-broken state where the channel reports "open" but publisher confirms never come back. The application library has no way to distinguish this from a healthy idle channel, and existing /healthz only checks HTTP, not the AMQP publish path.

Proposed fix

Add GET /healthz/amqp that:

  1. Publishes a no-op message to a dedicated probe queue (e.g. healthz_probe_<podname>) with publisher confirms enabled.
  2. Returns 200 if the confirm comes back within 5s, 503 otherwise.
  3. Counts as "alive" only when AMQP publish actually works end-to-end.

Then update the livepeer-infra Helm chart's livenessProbe to call /healthz/amqp with failureThreshold: 3, periodSeconds: 30. Kubernetes will then auto-restart pods where the AMQP channel goes bad — no human intervention.

Workaround in place

A CronJob in livepeer/livepeer-infra that periodically scans pod logs for `Timeout publishing message` count and kubectl delete pod if it exceeds a threshold. This is a band-aid; the proper fix is this issue.

References

  • Incident debugging session: 2026-05-02, fra cluster, pods prod-livepeer-api-55cf8f7b85-2xt2r and -dxmxn
  • Source of timeout: packages/api/src/store/queue.ts:336 (_channelPublish)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions