Summary
We hit a production incident on 2026-05-02 where 2 of 4 prod-livepeer-api pods in fra had broken AMQP publish channels but the existing /healthz endpoint kept returning 200. Result: every request-upload and webhook publish that landed on those pods hung 60s waiting on a publisher confirm that never arrived → Cloudflare 524 → users saw "network error" and assets stuck in uploading phase indefinitely.
The pods recovered only after a manual kubectl delete pod.
Evidence
- Stack trace from
livepeer-api logs:
Error: Timeout publishing message events.stream.idle to queue
at o._channelPublish (/usr/local/src/store/queue.ts:336:19)
at o.publish (/usr/local/src/store/queue.ts:239:5)
at o.publishWebhook (/usr/local/src/store/queue.ts:226:5)
- Affected pods (uptime 8-12h since last restart) had 80 / 72 timeouts in 10 min.
- Healthy pods on the same deployment had 0.
- AMQP cluster itself was healthy: nodes 11-50% memory, no alarms, normal publish rate cluster-wide.
- Restarting the 2 affected pods immediately resolved the issue (0 timeouts after restart).
Root cause hypothesis
After a TCP reconnect, amqp-connection-manager can land the recovered channel in a half-broken state where the channel reports "open" but publisher confirms never come back. The application library has no way to distinguish this from a healthy idle channel, and existing /healthz only checks HTTP, not the AMQP publish path.
Proposed fix
Add GET /healthz/amqp that:
- Publishes a no-op message to a dedicated probe queue (e.g.
healthz_probe_<podname>) with publisher confirms enabled.
- Returns 200 if the confirm comes back within 5s, 503 otherwise.
- Counts as "alive" only when AMQP publish actually works end-to-end.
Then update the livepeer-infra Helm chart's livenessProbe to call /healthz/amqp with failureThreshold: 3, periodSeconds: 30. Kubernetes will then auto-restart pods where the AMQP channel goes bad — no human intervention.
Workaround in place
A CronJob in livepeer/livepeer-infra that periodically scans pod logs for `Timeout publishing message` count and kubectl delete pod if it exceeds a threshold. This is a band-aid; the proper fix is this issue.
References
- Incident debugging session: 2026-05-02, fra cluster, pods
prod-livepeer-api-55cf8f7b85-2xt2r and -dxmxn
- Source of timeout:
packages/api/src/store/queue.ts:336 (_channelPublish)
Summary
We hit a production incident on 2026-05-02 where 2 of 4
prod-livepeer-apipods infrahad broken AMQP publish channels but the existing/healthzendpoint kept returning 200. Result: everyrequest-uploadand webhook publish that landed on those pods hung 60s waiting on a publisher confirm that never arrived → Cloudflare 524 → users saw "network error" and assets stuck inuploadingphase indefinitely.The pods recovered only after a manual
kubectl delete pod.Evidence
livepeer-apilogs:Root cause hypothesis
After a TCP reconnect,
amqp-connection-managercan land the recovered channel in a half-broken state where the channel reports "open" but publisher confirms never come back. The application library has no way to distinguish this from a healthy idle channel, and existing/healthzonly checks HTTP, not the AMQP publish path.Proposed fix
Add
GET /healthz/amqpthat:healthz_probe_<podname>) with publisher confirms enabled.Then update the
livepeer-infraHelm chart'slivenessProbeto call/healthz/amqpwithfailureThreshold: 3, periodSeconds: 30. Kubernetes will then auto-restart pods where the AMQP channel goes bad — no human intervention.Workaround in place
A CronJob in
livepeer/livepeer-infrathat periodically scans pod logs for `Timeout publishing message` count andkubectl delete podif it exceeds a threshold. This is a band-aid; the proper fix is this issue.References
prod-livepeer-api-55cf8f7b85-2xt2rand-dxmxnpackages/api/src/store/queue.ts:336(_channelPublish)