feat: add deep health and readiness checks (AURA-RM-002) by brandon-shelton-mezmo · Pull Request #117 · mezmo/aura

brandon-shelton-mezmo · 2026-05-12T16:19:27Z

Summary

Add /health/live (liveness) and /health/ready (readiness) endpoints with per-subsystem connectivity verification for LLM providers, MCP servers, and vector stores
Lightweight probes: HTTP model-listing for LLM, GET for MCP, REST health for Qdrant, AWS credential resolution for Bedrock
Cached results with configurable TTL (HEALTH_CHECK_CACHE_TTL_SECS, default 10s)
Per-probe timeout (HEALTH_CHECK_TIMEOUT_SECS, default 5s)
Prometheus metrics: aura_health_ready gauge, per-subsystem aura_health_subsystem_status gauges, aura_health_check_duration_seconds histogram
Sanitized error messages in health responses (no internal IPs/hostnames)
Shutdown guard exemption for /health/* paths
Existing /health endpoint unchanged (backward compatible)

Note: Includes AURA-RM-008 (error taxonomy + prometheus metrics) commits as prerequisites. Those are also in PR #80.

Test plan

cargo fmt --check passes
cargo clippy --workspace --all-targets -- -D warnings passes
cargo test --workspace --lib — 48 tests pass (includes 17 new health check tests)
Verify /health/live returns 200 {"status": "alive"}
Verify /health/ready returns 200 with per-subsystem JSON when healthy, 503 when unhealthy
Verify existing /health endpoint unchanged
Verify Docker Compose healthcheck still works
Jenkins integration tests pass

🤖 Generated with Claude Code

Establish spec-kit directory structure with constitution, spec templates (product, architecture, quality), and 11 roadmap items (AURA-RM-001 through AURA-RM-011) covering production readiness gaps from metrics to incident response. Define the 7-phase development lifecycle with multi-agent review gates. Ref: LOG-00000 Signed-off-by: brandon.shelton <brandon.shelton@mezmo.com>

AURA-RM-008: Structured Error Taxonomy - ErrorCategory enum with 13 runtime error categories - AuraError struct separating internal messages (logs) from sanitized client-facing messages (API responses) - ErrorDetail::classified() and validation() constructors - From impls for DetectedToolError and StreamTermination - All 6 error construction sites updated with taxonomy codes - Existing error_type values frozen (Article I compliance) - 13 unit tests AURA-RM-001: Prometheus Metrics Endpoint - /metrics endpoint with Prometheus text exposition format - aura_http_request_duration_seconds histogram (25ms-300s) - aura_llm_tokens_total counter (by type/provider/agent) - aura_mcp_tool_duration_seconds histogram (10ms-60s) - aura_errors_total counter (by taxonomy category) - aura_http_requests_in_flight gauge - aura_mcp_server_connected gauge - AURA_METRICS_ENABLED kill switch - /health and /metrics exempt from shutdown_guard - Cardinality guards (100 tool cap, 64-char name limit) - 9 integration tests, 2 unit tests - Starter Grafana dashboard (examples/grafana/) Ref: LOG-00000 Signed-off-by: brandon.shelton <brandon.shelton@mezmo.com>

The metrics_test.rs used "model": "Test Assistant" (agent name) but model lookup matches against alias when set. The test config has alias = "test-assistant", so all three chat-dependent tests got 404. Ref: LOG-00000 Signed-off-by: brandon-shelton-mezmo <brandon.shelton@mezmo.com>

Add /health/live (liveness) and /health/ready (readiness) endpoints with per-subsystem connectivity verification for LLM providers, MCP servers, and vector stores. - Lightweight probes: HTTP model-listing for LLM, GET for MCP, REST health for Qdrant, AWS credential resolution for Bedrock - Cached results with configurable TTL (HEALTH_CHECK_CACHE_TTL_SECS) - Per-probe timeout (HEALTH_CHECK_TIMEOUT_SECS) - Prometheus metrics: aura_health_ready gauge, per-subsystem gauges, check duration histogram - Sanitized error messages (no internal IPs/hostnames) - Shutdown guard exemption for /health/* paths - Existing /health endpoint unchanged (backward compatible) - Full spec-driven development: product, architecture, quality specs with multi-agent review (43 findings remediated) Ref: LOG-00000 Signed-off-by: brandon-shelton-mezmo <brandon.shelton@mezmo.com>

Document the new /health/live and /health/ready endpoints from AURA-RM-002: - Liveness and readiness probe endpoints with status codes - Configuration variables for cache TTL and probe timeout - Response format with per-subsystem status breakdown - Subsystem probe methods for LLM, MCP, and vector stores - Sanitized error message categories - Health check Prometheus metrics - Kubernetes probe configuration example Ref: #117

promptless · 2026-05-12T16:26:06Z

Promptless prepared a documentation update related to this change.

Triggered by PR #117

Added docs/observability.md documenting the new /health/live and /health/ready endpoints, environment variable configuration, and Prometheus health metrics.

Review: Add observability docs

kristof-mattei · 2026-05-12T16:58:37Z

+    let duration_ms = state
+        .tool_start_times
+        .remove(&tool_result.id)
+        .map(|start| start.elapsed().as_millis() as u64)


Unneeded cast to u64, we cast to f64 below.

kristof-mattei · 2026-05-12T17:12:06Z

+    let tool_name_for_metrics = state
+        .tool_call_map
+        .get(&tool_result.id)
+        .map(|(name, _)| name.clone())
+        .unwrap_or_else(|| "unknown".to_string());


We can avoid cloning here to save an allocation:

let tool_name_for_metrics = state .tool_call_map .get(&tool_result.id) .map(|(name, _)| &**name) .unwrap_or_else(|| "unknown");

kristof-mattei · 2026-05-12T17:15:05Z

+const MAX_TOOL_NAME_LEN: usize = 64;
+
+/// Record MCP tool call duration with cardinality guards on the tool label.
+pub fn record_tool_duration(server: &str, tool: &str, status: &str, duration_secs: f64) {


Why not take in an std::time::Duration and convert internally to what you want to display? That way we can leverage the type system to avoid someone misreading the parameter name and send you seconds or milliseconds or nanoseconds.

kristof-mattei · 2026-05-12T17:18:14Z

+
+/// Record MCP tool call duration with cardinality guards on the tool label.
+pub fn record_tool_duration(server: &str, tool: &str, status: &str, duration_secs: f64) {
+    let tool_label = if tool.len() > MAX_TOOL_NAME_LEN {


.len() on a string returns the amount of bytes, which is different from the character length (i.e. the width on screen) you're testing here. And even that isn't 100% if we consider graphemes.

It probably doesn't matter in this code base, where we only have tools with ASCII names, but ideally we would do something like tool.chars().count() to check the UTF-8 character count.

But wouldn't it be better to truncate?

kristof-mattei · 2026-05-12T17:40:34Z

+    let tool_label = if tool.len() > MAX_TOOL_NAME_LEN {
+        "_other"
+    } else {
+        let mut known = KNOWN_TOOLS.lock().unwrap_or_else(|e| e.into_inner());


This requires some more thought.

When a lock is poisoned it means that somewhere in the codebase a lock was acquired and panicked before the lock was properly dropped.

Which means that this moment we cannot make any assumptions of the contents of the lock.

While the std library does it's very best to ensure that even in the case of a panic the internal state of the HashMap remains consistent, it's worth thinking about whether we want to rely on that mechanism, and we probably should document our thinking.

Equally if there are multiple code-paths accessing KNOWN_TOOLS, that do multiple operations while holding a single lock, we don't know whether these changes have been atomically committed. If we do a remove and an add within the same lock, it might be that the remove was called, but the add was not.

While the HashMap is still in a valid state from Rust's point of view, it now violates the invariants we have placed on top of it.

For 1, this is something we don't really control. The only way I can see corruption there is when the thread would forcefully get shot down, which... shouldn't happen.

For 2, If we are able to validate that the usages of KNOWN_TOOLS, and the taking of the lock are atomically, or we don't care about the overall state of the HashMap itself, we can do into_inner(). However, in the case that it is poisoned (which we know based on the lock() result we probably want to clear the the poison with clear_poison.

On the other hand, if we're not able (or don't wish) to continue with a state we can't trust, we can always re-init the HashMap, losing our state.

In either case, I'd love to see an ERROR or WARN log when the lock is poisoned, because when that happens, we probably have an issue somewhere else.

kristof-mattei · 2026-05-12T17:45:01Z

-        // Try to unescape JSON-quoted strings for error detection
-        let result_text = serde_json::from_str::<String>(&tool_result.result)
-            .unwrap_or_else(|_| tool_result.result.clone());
+        let tool_name = tool_name_for_metrics.clone();


We probably don't need the clone here.

kristof-mattei · 2026-05-12T17:48:30Z

+        McpServerConfig::Stdio { .. } => (
+            "stdio",
+            McpHealthResult {
+                transport: "stdio".to_string(),


.to_owned() to get an owned instance of a &str.

kristof-mattei · 2026-05-12T17:50:26Z

+}
+
+/// Track unique tool names globally to enforce cardinality cap.
+static KNOWN_TOOLS: LazyLock<Mutex<HashSet<String>>> = LazyLock::new(|| Mutex::new(HashSet::new()));


These should be moved to the top.

kristof-mattei · 2026-05-12T17:56:40Z

+    /// exceed this timeout are marked as unhealthy. Must be shorter than K8s
+    /// probe timeoutSeconds (recommend K8s timeout >= probe timeout + 5s).
+    #[arg(long, env = "HEALTH_CHECK_TIMEOUT_SECS", default_value = "5")]
+    health_check_timeout_secs: u64,


We have an opportunity here to have health_check_timeout as Duration, then internally we don't have to convert it. We can still name env variable as ..._SECS and convert it internally.

Same for the long option.

This way we use the type system to pass on a Duration, and no-one internally needs to worry about whether it's seconds, ...

A u64 has no provenance of whether it came from seconds or milliseconds or ...

Story about a couple of dollars lost because 2 teams used different units and stored them in a type that didn't carry that information:

An investigation attributed the failure to a measurement mismatch between two measurement systems: SI units (metric) by NASA and US customary units by spacecraft builder Lockheed Martin.

https://en.wikipedia.org/wiki/Mars_Climate_Orbiter

kristof-mattei · 2026-05-12T17:58:16Z

General remark: many places where to_string() should be replaced with to_owned().

kristof-mattei · 2026-05-12T19:13:37Z

@@ -105,6 +105,16 @@ struct Args {
    health_check_timeout_secs: u64,


#[arg( env = "HEALTH_CHECK_TIMEOUT_SECS", default_value = "5", long = "health_check-timeout-secs", help = "...", value_parser = parse_duration_from_secs )] pub health_check_timeout: Duration, ... fn parse_duration_from_secs(value: &str) -> Result<Duration, String> { let seconds = value .parse() .map_err(|error| format!("Could not parse `{}`: {}", value, error))?; Ok(Duration::from_secs(seconds)) }

Using this you leverage clap to do the conversion for you, and no-where in accessible code do we have access to the broadly typed version.

- metrics: Accept Duration instead of f64 for record_tool_duration - metrics: Move KNOWN_TOOLS static to top of file - metrics: Use chars().count() for UTF-8 safe length - metrics: Log warning and clear poison on lock recovery - handlers: Remove unnecessary u64 cast, avoid clone - health: Replace to_string() with to_owned() - main: Convert health check args to Duration Ref: LOG-00000 Signed-off-by: brandon-shelton-mezmo <brandon.shelton@mezmo.com>

Use clap value_parser to parse CLI args directly into Duration at the boundary. The raw u64 is never stored as a field, preventing unit mismatch bugs. Ref: LOG-00000 Signed-off-by: brandon-shelton-mezmo <brandon.shelton@mezmo.com>

brandon-shelton-mezmo added 2 commits May 12, 2026 11:18

brandon-shelton-mezmo requested a review from a team May 12, 2026 16:19

brandon-shelton-mezmo added 2 commits May 12, 2026 11:22

brandon-shelton-mezmo force-pushed the feature/AURA-RM-002-deep-health-checks branch from a1c1e9f to bc0d38d Compare May 12, 2026 16:22

brandon-shelton-mezmo force-pushed the feature/AURA-RM-002-deep-health-checks branch 2 times, most recently from af629ea to afbd618 Compare May 12, 2026 16:37

kristof-mattei reviewed May 12, 2026

View reviewed changes

brandon-shelton-mezmo mentioned this pull request May 12, 2026

feat: add structured error taxonomy and prometheus metrics (AURA-RM-008 + RM-001) #80

Closed

15 tasks

brandon-shelton-mezmo added 2 commits May 12, 2026 14:44

refactor: Parse health check args as Duration

22e974d

Use clap value_parser to parse CLI args directly into Duration at the boundary. The raw u64 is never stored as a field, preventing unit mismatch bugs. Ref: LOG-00000 Signed-off-by: brandon-shelton-mezmo <brandon.shelton@mezmo.com>

brandon-shelton-mezmo force-pushed the feature/AURA-RM-002-deep-health-checks branch from 36c1bcc to 22e974d Compare May 12, 2026 19:45

brandon-shelton-mezmo requested a review from kristof-mattei May 12, 2026 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add deep health and readiness checks (AURA-RM-002)#117

feat: add deep health and readiness checks (AURA-RM-002)#117
brandon-shelton-mezmo wants to merge 6 commits into
mainfrom
feature/AURA-RM-002-deep-health-checks

brandon-shelton-mezmo commented May 12, 2026

Uh oh!

promptless Bot commented May 12, 2026

Uh oh!

kristof-mattei May 12, 2026

Uh oh!

kristof-mattei May 12, 2026

Uh oh!

kristof-mattei May 12, 2026

Uh oh!

kristof-mattei May 12, 2026

Uh oh!

kristof-mattei May 12, 2026

Uh oh!

kristof-mattei May 12, 2026

Uh oh!

kristof-mattei May 12, 2026

Uh oh!

kristof-mattei May 12, 2026

Uh oh!

kristof-mattei May 12, 2026

Uh oh!

kristof-mattei commented May 12, 2026

Uh oh!

kristof-mattei May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -105,6 +105,16 @@ struct Args {
		health_check_timeout_secs: u64,

Conversation

brandon-shelton-mezmo commented May 12, 2026

Summary

Test plan

Uh oh!

promptless Bot commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kristof-mattei commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants