Add backend health probing for Envoy upstreams by lai3d · Pull Request #6 · lai3d/sigma

lai3d · 2026-05-18T17:33:13Z

Summary

Adds an opt-in loop that periodically TCP-probes each active Envoy upstream for this VPS and exposes reachability + latency via Prometheus and the MCP server. Independent of Envoy's own health checks — gives the agent direct visibility into data-plane reachability.

What

New module sigma-agent/src/health_probe.rs:

probe_loop() fetches active envoy_nodes for this VPS, then for each node fetches active envoy_routes, then TCP-probes each route's backend_host:backend_port sequentially
probe_backend() is a pure-ish helper that wraps tokio::net::TcpStream::connect with tokio::time::timeout (clamped to 3s max)
Results land in Arc<RwLock<Vec<BackendProbeResult>>> and feed both Prometheus and MCP

Wire-up:

--health-probe / AGENT_HEALTH_PROBE (default false)
--health-probe-interval / AGENT_HEALTH_PROBE_INTERVAL (default 30s)
Prometheus: sigma_backend_reachable{route, listen_port, backend} (1/0), sigma_backend_probe_latency_ms{route, listen_port, backend}
MCP: new tool query_backend_health with optional unreachable_only: bool filter

Design

Sequential, not concurrent. Probing one backend at a time bounds CPU bursts — fits the agent's <1% steady-state budget on 1 vCPU VPS. Concurrent probing would briefly spike CPU as N tasks compete for the runtime.
3s timeout ceiling. Enforced via MAX_PROBE_TIMEOUT_SECS even if a caller asks for more — keeps per-iteration work bounded.
Skips gracefully if registration failed. vps_id must be Some (agent registered with sigma-api). If not, logs a warning and the spawn is skipped — agent continues without probing.

Test plan

cargo check --no-default-features — clean
cargo test --no-default-features — 27/27 pass (2 new tokio tests for closed-port behavior and timeout clamping)
README.md updated with new ## Backend Health Probing section
Live test against a real Envoy fleet (separate step)

Part of the agent roadmap.

Adds an opt-in (`--health-probe`) loop that TCP-probes every active `envoy_route` upstream for this VPS at a configurable interval (default 30s). Each probe is a `tokio::net::TcpStream::connect` wrapped in a 3s timeout — hard-clamped, so one stuck backend can never stall the cycle. Routes are iterated **sequentially** in a single task to stay within the agent's <1% CPU / <50MB RSS budget on 1 vCPU / 512MB VPS instances; fan-out would defeat that. Results are written to a shared `Arc<RwLock<Vec<BackendProbeResult>>>` snapshot consumed by: - Prometheus `/metrics`: `sigma_backend_reachable{...}` (1/0) + `sigma_backend_probe_latency_ms{...}` gauges, labeled by route / listen_port / backend. - MCP tool `query_backend_health` with optional `unreachable_only` filter for incident-triage workflows. This validates the data plane independently of Envoy's own active health checks — useful when Envoy thinks an upstream is healthy but the agent itself can't TCP-connect from the same host (firewall, route, DNS issue). The probe loop is only spawned when VPS registration succeeded (vps_id is needed to scope the `envoy_nodes`/`envoy_routes` queries); failed registration logs a warning and skips the spawn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lai3d force-pushed the claude/health-probe branch from 330662d to 33318cd Compare May 18, 2026 17:43

lai3d merged commit 06b181d into main May 18, 2026

lai3d deleted the claude/health-probe branch May 18, 2026 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add backend health probing for Envoy upstreams#6

Add backend health probing for Envoy upstreams#6
lai3d merged 1 commit into
mainfrom
claude/health-probe

lai3d commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lai3d commented May 18, 2026

Summary

What

Design

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant