Skip to content

Add backend health probing for Envoy upstreams#6

Merged
lai3d merged 1 commit into
mainfrom
claude/health-probe
May 18, 2026
Merged

Add backend health probing for Envoy upstreams#6
lai3d merged 1 commit into
mainfrom
claude/health-probe

Conversation

@lai3d
Copy link
Copy Markdown
Owner

@lai3d lai3d commented May 18, 2026

Summary

Adds an opt-in loop that periodically TCP-probes each active Envoy upstream for this VPS and exposes reachability + latency via Prometheus and the MCP server. Independent of Envoy's own health checks — gives the agent direct visibility into data-plane reachability.

What

New module sigma-agent/src/health_probe.rs:

  • probe_loop() fetches active envoy_nodes for this VPS, then for each node fetches active envoy_routes, then TCP-probes each route's backend_host:backend_port sequentially
  • probe_backend() is a pure-ish helper that wraps tokio::net::TcpStream::connect with tokio::time::timeout (clamped to 3s max)
  • Results land in Arc<RwLock<Vec<BackendProbeResult>>> and feed both Prometheus and MCP

Wire-up:

  • --health-probe / AGENT_HEALTH_PROBE (default false)
  • --health-probe-interval / AGENT_HEALTH_PROBE_INTERVAL (default 30s)
  • Prometheus: sigma_backend_reachable{route, listen_port, backend} (1/0), sigma_backend_probe_latency_ms{route, listen_port, backend}
  • MCP: new tool query_backend_health with optional unreachable_only: bool filter

Design

  • Sequential, not concurrent. Probing one backend at a time bounds CPU bursts — fits the agent's <1% steady-state budget on 1 vCPU VPS. Concurrent probing would briefly spike CPU as N tasks compete for the runtime.
  • 3s timeout ceiling. Enforced via MAX_PROBE_TIMEOUT_SECS even if a caller asks for more — keeps per-iteration work bounded.
  • Skips gracefully if registration failed. vps_id must be Some (agent registered with sigma-api). If not, logs a warning and the spawn is skipped — agent continues without probing.

Test plan

  • cargo check --no-default-features — clean
  • cargo test --no-default-features — 27/27 pass (2 new tokio tests for closed-port behavior and timeout clamping)
  • README.md updated with new ## Backend Health Probing section
  • Live test against a real Envoy fleet (separate step)

Part of the agent roadmap.

Adds an opt-in (`--health-probe`) loop that TCP-probes every active
`envoy_route` upstream for this VPS at a configurable interval (default 30s).
Each probe is a `tokio::net::TcpStream::connect` wrapped in a 3s timeout —
hard-clamped, so one stuck backend can never stall the cycle. Routes are
iterated **sequentially** in a single task to stay within the agent's
<1% CPU / <50MB RSS budget on 1 vCPU / 512MB VPS instances; fan-out would
defeat that.

Results are written to a shared `Arc<RwLock<Vec<BackendProbeResult>>>`
snapshot consumed by:
  - Prometheus `/metrics`: `sigma_backend_reachable{...}` (1/0) +
    `sigma_backend_probe_latency_ms{...}` gauges, labeled by route /
    listen_port / backend.
  - MCP tool `query_backend_health` with optional `unreachable_only` filter
    for incident-triage workflows.

This validates the data plane independently of Envoy's own active health
checks — useful when Envoy thinks an upstream is healthy but the agent
itself can't TCP-connect from the same host (firewall, route, DNS issue).

The probe loop is only spawned when VPS registration succeeded (vps_id is
needed to scope the `envoy_nodes`/`envoy_routes` queries); failed registration
logs a warning and skips the spawn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lai3d lai3d force-pushed the claude/health-probe branch from 330662d to 33318cd Compare May 18, 2026 17:43
@lai3d lai3d merged commit 06b181d into main May 18, 2026
@lai3d lai3d deleted the claude/health-probe branch May 18, 2026 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant