feat: add deep health and readiness checks (AURA-RM-002)#117
feat: add deep health and readiness checks (AURA-RM-002)#117brandon-shelton-mezmo wants to merge 6 commits into
Conversation
Establish spec-kit directory structure with constitution, spec templates (product, architecture, quality), and 11 roadmap items (AURA-RM-001 through AURA-RM-011) covering production readiness gaps from metrics to incident response. Define the 7-phase development lifecycle with multi-agent review gates. Ref: LOG-00000 Signed-off-by: brandon.shelton <brandon.shelton@mezmo.com>
AURA-RM-008: Structured Error Taxonomy - ErrorCategory enum with 13 runtime error categories - AuraError struct separating internal messages (logs) from sanitized client-facing messages (API responses) - ErrorDetail::classified() and validation() constructors - From impls for DetectedToolError and StreamTermination - All 6 error construction sites updated with taxonomy codes - Existing error_type values frozen (Article I compliance) - 13 unit tests AURA-RM-001: Prometheus Metrics Endpoint - /metrics endpoint with Prometheus text exposition format - aura_http_request_duration_seconds histogram (25ms-300s) - aura_llm_tokens_total counter (by type/provider/agent) - aura_mcp_tool_duration_seconds histogram (10ms-60s) - aura_errors_total counter (by taxonomy category) - aura_http_requests_in_flight gauge - aura_mcp_server_connected gauge - AURA_METRICS_ENABLED kill switch - /health and /metrics exempt from shutdown_guard - Cardinality guards (100 tool cap, 64-char name limit) - 9 integration tests, 2 unit tests - Starter Grafana dashboard (examples/grafana/) Ref: LOG-00000 Signed-off-by: brandon.shelton <brandon.shelton@mezmo.com>
The metrics_test.rs used "model": "Test Assistant" (agent name) but model lookup matches against alias when set. The test config has alias = "test-assistant", so all three chat-dependent tests got 404. Ref: LOG-00000 Signed-off-by: brandon-shelton-mezmo <brandon.shelton@mezmo.com>
Add /health/live (liveness) and /health/ready (readiness) endpoints with per-subsystem connectivity verification for LLM providers, MCP servers, and vector stores. - Lightweight probes: HTTP model-listing for LLM, GET for MCP, REST health for Qdrant, AWS credential resolution for Bedrock - Cached results with configurable TTL (HEALTH_CHECK_CACHE_TTL_SECS) - Per-probe timeout (HEALTH_CHECK_TIMEOUT_SECS) - Prometheus metrics: aura_health_ready gauge, per-subsystem gauges, check duration histogram - Sanitized error messages (no internal IPs/hostnames) - Shutdown guard exemption for /health/* paths - Existing /health endpoint unchanged (backward compatible) - Full spec-driven development: product, architecture, quality specs with multi-agent review (43 findings remediated) Ref: LOG-00000 Signed-off-by: brandon-shelton-mezmo <brandon.shelton@mezmo.com>
a1c1e9f to
bc0d38d
Compare
Document the new /health/live and /health/ready endpoints from AURA-RM-002: - Liveness and readiness probe endpoints with status codes - Configuration variables for cache TTL and probe timeout - Response format with per-subsystem status breakdown - Subsystem probe methods for LLM, MCP, and vector stores - Sanitized error message categories - Health check Prometheus metrics - Kubernetes probe configuration example Ref: #117
|
Promptless prepared a documentation update related to this change. Triggered by PR #117 Added Review: Add observability docs |
af629ea to
afbd618
Compare
| let duration_ms = state | ||
| .tool_start_times | ||
| .remove(&tool_result.id) | ||
| .map(|start| start.elapsed().as_millis() as u64) |
There was a problem hiding this comment.
Unneeded cast to u64, we cast to f64 below.
| let tool_name_for_metrics = state | ||
| .tool_call_map | ||
| .get(&tool_result.id) | ||
| .map(|(name, _)| name.clone()) | ||
| .unwrap_or_else(|| "unknown".to_string()); |
There was a problem hiding this comment.
We can avoid cloning here to save an allocation:
let tool_name_for_metrics = state
.tool_call_map
.get(&tool_result.id)
.map(|(name, _)| &**name)
.unwrap_or_else(|| "unknown");
| const MAX_TOOL_NAME_LEN: usize = 64; | ||
|
|
||
| /// Record MCP tool call duration with cardinality guards on the tool label. | ||
| pub fn record_tool_duration(server: &str, tool: &str, status: &str, duration_secs: f64) { |
There was a problem hiding this comment.
Why not take in an std::time::Duration and convert internally to what you want to display? That way we can leverage the type system to avoid someone misreading the parameter name and send you seconds or milliseconds or nanoseconds.
|
|
||
| /// Record MCP tool call duration with cardinality guards on the tool label. | ||
| pub fn record_tool_duration(server: &str, tool: &str, status: &str, duration_secs: f64) { | ||
| let tool_label = if tool.len() > MAX_TOOL_NAME_LEN { |
There was a problem hiding this comment.
.len() on a string returns the amount of bytes, which is different from the character length (i.e. the width on screen) you're testing here. And even that isn't 100% if we consider graphemes.
It probably doesn't matter in this code base, where we only have tools with ASCII names, but ideally we would do something like tool.chars().count() to check the UTF-8 character count.
But wouldn't it be better to truncate?
| let tool_label = if tool.len() > MAX_TOOL_NAME_LEN { | ||
| "_other" | ||
| } else { | ||
| let mut known = KNOWN_TOOLS.lock().unwrap_or_else(|e| e.into_inner()); |
There was a problem hiding this comment.
This requires some more thought.
When a lock is poisoned it means that somewhere in the codebase a lock was acquired and panicked before the lock was properly dropped.
Which means that this moment we cannot make any assumptions of the contents of the lock.
-
While the
stdlibrary does it's very best to ensure that even in the case of a panic the internal state of the HashMap remains consistent, it's worth thinking about whether we want to rely on that mechanism, and we probably should document our thinking. -
Equally if there are multiple code-paths accessing
KNOWN_TOOLS, that do multiple operations while holding a single lock, we don't know whether these changes have been atomically committed. If we do a remove and an add within the same lock, it might be that the remove was called, but the add was not.
While the HashMap is still in a valid state from Rust's point of view, it now violates the invariants we have placed on top of it.
For 1, this is something we don't really control. The only way I can see corruption there is when the thread would forcefully get shot down, which... shouldn't happen.
For 2, If we are able to validate that the usages of KNOWN_TOOLS, and the taking of the lock are atomically, or we don't care about the overall state of the HashMap itself, we can do into_inner(). However, in the case that it is poisoned (which we know based on the lock() result we probably want to clear the the poison with clear_poison.
On the other hand, if we're not able (or don't wish) to continue with a state we can't trust, we can always re-init the HashMap, losing our state.
In either case, I'd love to see an ERROR or WARN log when the lock is poisoned, because when that happens, we probably have an issue somewhere else.
| // Try to unescape JSON-quoted strings for error detection | ||
| let result_text = serde_json::from_str::<String>(&tool_result.result) | ||
| .unwrap_or_else(|_| tool_result.result.clone()); | ||
| let tool_name = tool_name_for_metrics.clone(); |
There was a problem hiding this comment.
We probably don't need the clone here.
| McpServerConfig::Stdio { .. } => ( | ||
| "stdio", | ||
| McpHealthResult { | ||
| transport: "stdio".to_string(), |
There was a problem hiding this comment.
.to_owned() to get an owned instance of a &str.
| } | ||
|
|
||
| /// Track unique tool names globally to enforce cardinality cap. | ||
| static KNOWN_TOOLS: LazyLock<Mutex<HashSet<String>>> = LazyLock::new(|| Mutex::new(HashSet::new())); |
There was a problem hiding this comment.
These should be moved to the top.
| /// exceed this timeout are marked as unhealthy. Must be shorter than K8s | ||
| /// probe timeoutSeconds (recommend K8s timeout >= probe timeout + 5s). | ||
| #[arg(long, env = "HEALTH_CHECK_TIMEOUT_SECS", default_value = "5")] | ||
| health_check_timeout_secs: u64, |
There was a problem hiding this comment.
We have an opportunity here to have health_check_timeout as Duration, then internally we don't have to convert it. We can still name env variable as ..._SECS and convert it internally.
Same for the long option.
This way we use the type system to pass on a Duration, and no-one internally needs to worry about whether it's seconds, ...
A u64 has no provenance of whether it came from seconds or milliseconds or ...
Story about a couple of dollars lost because 2 teams used different units and stored them in a type that didn't carry that information:
An investigation attributed the failure to a measurement mismatch between two measurement systems: SI units (metric) by NASA and US customary units by spacecraft builder Lockheed Martin.
|
General remark: many places where |
| @@ -105,6 +105,16 @@ struct Args { | |||
| health_check_timeout_secs: u64, | |||
There was a problem hiding this comment.
#[arg(
env = "HEALTH_CHECK_TIMEOUT_SECS",
default_value = "5",
long = "health_check-timeout-secs",
help = "...",
value_parser = parse_duration_from_secs
)]
pub health_check_timeout: Duration,
...
fn parse_duration_from_secs(value: &str) -> Result<Duration, String> {
let seconds = value
.parse()
.map_err(|error| format!("Could not parse `{}`: {}", value, error))?;
Ok(Duration::from_secs(seconds))
}
Using this you leverage clap to do the conversion for you, and no-where in accessible code do we have access to the broadly typed version.
- metrics: Accept Duration instead of f64 for record_tool_duration - metrics: Move KNOWN_TOOLS static to top of file - metrics: Use chars().count() for UTF-8 safe length - metrics: Log warning and clear poison on lock recovery - handlers: Remove unnecessary u64 cast, avoid clone - health: Replace to_string() with to_owned() - main: Convert health check args to Duration Ref: LOG-00000 Signed-off-by: brandon-shelton-mezmo <brandon.shelton@mezmo.com>
Use clap value_parser to parse CLI args directly into Duration at the boundary. The raw u64 is never stored as a field, preventing unit mismatch bugs. Ref: LOG-00000 Signed-off-by: brandon-shelton-mezmo <brandon.shelton@mezmo.com>
36c1bcc to
22e974d
Compare
Summary
/health/live(liveness) and/health/ready(readiness) endpoints with per-subsystem connectivity verification for LLM providers, MCP servers, and vector storesHEALTH_CHECK_CACHE_TTL_SECS, default 10s)HEALTH_CHECK_TIMEOUT_SECS, default 5s)aura_health_readygauge, per-subsystemaura_health_subsystem_statusgauges,aura_health_check_duration_secondshistogram/health/*paths/healthendpoint unchanged (backward compatible)Note: Includes AURA-RM-008 (error taxonomy + prometheus metrics) commits as prerequisites. Those are also in PR #80.
Test plan
cargo fmt --checkpassescargo clippy --workspace --all-targets -- -D warningspassescargo test --workspace --lib— 48 tests pass (includes 17 new health check tests)/health/livereturns200 {"status": "alive"}/health/readyreturns200with per-subsystem JSON when healthy,503when unhealthy/healthendpoint unchanged🤖 Generated with Claude Code