livepeer-api: add AMQP-aware liveness probe (/healthz/amqp) to detect broken publish channels

## Summary

We hit a production incident on 2026-05-02 where 2 of 4 `prod-livepeer-api` pods in `fra` had broken AMQP publish channels but the existing `/healthz` endpoint kept returning 200. Result: every `request-upload` and webhook publish that landed on those pods hung 60s waiting on a publisher confirm that never arrived → Cloudflare 524 → users saw "network error" and assets stuck in `uploading` phase indefinitely.

The pods recovered only after a manual `kubectl delete pod`.

## Evidence

- Stack trace from `livepeer-api` logs:
  ```
  Error: Timeout publishing message events.stream.idle to queue
    at o._channelPublish (/usr/local/src/store/queue.ts:336:19)
    at o.publish (/usr/local/src/store/queue.ts:239:5)
    at o.publishWebhook (/usr/local/src/store/queue.ts:226:5)
  ```
- Affected pods (uptime 8-12h since last restart) had **80 / 72 timeouts in 10 min**.
- Healthy pods on the same deployment had **0**.
- AMQP cluster itself was healthy: nodes 11-50% memory, no alarms, normal publish rate cluster-wide.
- Restarting the 2 affected pods immediately resolved the issue (0 timeouts after restart).

## Root cause hypothesis

After a TCP reconnect, `amqp-connection-manager` can land the recovered channel in a half-broken state where the channel reports "open" but publisher confirms never come back. The application library has no way to distinguish this from a healthy idle channel, and existing `/healthz` only checks HTTP, not the AMQP publish path.

## Proposed fix

Add `GET /healthz/amqp` that:
1. Publishes a no-op message to a dedicated probe queue (e.g. `healthz_probe_<podname>`) with publisher confirms enabled.
2. Returns 200 if the confirm comes back within 5s, 503 otherwise.
3. Counts as "alive" only when AMQP publish actually works end-to-end.

Then update the `livepeer-infra` Helm chart's `livenessProbe` to call `/healthz/amqp` with `failureThreshold: 3, periodSeconds: 30`. Kubernetes will then auto-restart pods where the AMQP channel goes bad — no human intervention.

## Workaround in place

A CronJob in `livepeer/livepeer-infra` that periodically scans pod logs for \`Timeout publishing message\` count and `kubectl delete pod` if it exceeds a threshold. This is a band-aid; the proper fix is this issue.

## References

- Incident debugging session: 2026-05-02, fra cluster, pods `prod-livepeer-api-55cf8f7b85-2xt2r` and `-dxmxn`
- Source of timeout: `packages/api/src/store/queue.ts:336` (`_channelPublish`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

livepeer-api: add AMQP-aware liveness probe (/healthz/amqp) to detect broken publish channels #2350

Summary

Evidence

Root cause hypothesis

Proposed fix

Workaround in place

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

livepeer-api: add AMQP-aware liveness probe (/healthz/amqp) to detect broken publish channels #2350

Description

Summary

Evidence

Root cause hypothesis

Proposed fix

Workaround in place

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions