Add backend health probing for Envoy upstreams#6
Merged
Conversation
Adds an opt-in (`--health-probe`) loop that TCP-probes every active
`envoy_route` upstream for this VPS at a configurable interval (default 30s).
Each probe is a `tokio::net::TcpStream::connect` wrapped in a 3s timeout —
hard-clamped, so one stuck backend can never stall the cycle. Routes are
iterated **sequentially** in a single task to stay within the agent's
<1% CPU / <50MB RSS budget on 1 vCPU / 512MB VPS instances; fan-out would
defeat that.
Results are written to a shared `Arc<RwLock<Vec<BackendProbeResult>>>`
snapshot consumed by:
- Prometheus `/metrics`: `sigma_backend_reachable{...}` (1/0) +
`sigma_backend_probe_latency_ms{...}` gauges, labeled by route /
listen_port / backend.
- MCP tool `query_backend_health` with optional `unreachable_only` filter
for incident-triage workflows.
This validates the data plane independently of Envoy's own active health
checks — useful when Envoy thinks an upstream is healthy but the agent
itself can't TCP-connect from the same host (firewall, route, DNS issue).
The probe loop is only spawned when VPS registration succeeded (vps_id is
needed to scope the `envoy_nodes`/`envoy_routes` queries); failed registration
logs a warning and skips the spawn.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
330662d to
33318cd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in loop that periodically TCP-probes each active Envoy upstream for this VPS and exposes reachability + latency via Prometheus and the MCP server. Independent of Envoy's own health checks — gives the agent direct visibility into data-plane reachability.
What
New module
sigma-agent/src/health_probe.rs:probe_loop()fetches activeenvoy_nodesfor this VPS, then for each node fetches activeenvoy_routes, then TCP-probes each route'sbackend_host:backend_portsequentiallyprobe_backend()is a pure-ish helper that wrapstokio::net::TcpStream::connectwithtokio::time::timeout(clamped to 3s max)Arc<RwLock<Vec<BackendProbeResult>>>and feed both Prometheus and MCPWire-up:
--health-probe/AGENT_HEALTH_PROBE(default false)--health-probe-interval/AGENT_HEALTH_PROBE_INTERVAL(default 30s)sigma_backend_reachable{route, listen_port, backend}(1/0),sigma_backend_probe_latency_ms{route, listen_port, backend}query_backend_healthwith optionalunreachable_only: boolfilterDesign
MAX_PROBE_TIMEOUT_SECSeven if a caller asks for more — keeps per-iteration work bounded.vps_idmust be Some (agent registered with sigma-api). If not, logs a warning and the spawn is skipped — agent continues without probing.Test plan
cargo check --no-default-features— cleancargo test --no-default-features— 27/27 pass (2 new tokio tests for closed-port behavior and timeout clamping)## Backend Health ProbingsectionPart of the agent roadmap.