Skip to content

feat: add deep health and readiness checks (AURA-RM-002)#117

Open
brandon-shelton-mezmo wants to merge 6 commits into
mainfrom
feature/AURA-RM-002-deep-health-checks
Open

feat: add deep health and readiness checks (AURA-RM-002)#117
brandon-shelton-mezmo wants to merge 6 commits into
mainfrom
feature/AURA-RM-002-deep-health-checks

Conversation

@brandon-shelton-mezmo
Copy link
Copy Markdown

Summary

  • Add /health/live (liveness) and /health/ready (readiness) endpoints with per-subsystem connectivity verification for LLM providers, MCP servers, and vector stores
  • Lightweight probes: HTTP model-listing for LLM, GET for MCP, REST health for Qdrant, AWS credential resolution for Bedrock
  • Cached results with configurable TTL (HEALTH_CHECK_CACHE_TTL_SECS, default 10s)
  • Per-probe timeout (HEALTH_CHECK_TIMEOUT_SECS, default 5s)
  • Prometheus metrics: aura_health_ready gauge, per-subsystem aura_health_subsystem_status gauges, aura_health_check_duration_seconds histogram
  • Sanitized error messages in health responses (no internal IPs/hostnames)
  • Shutdown guard exemption for /health/* paths
  • Existing /health endpoint unchanged (backward compatible)

Note: Includes AURA-RM-008 (error taxonomy + prometheus metrics) commits as prerequisites. Those are also in PR #80.

Test plan

  • cargo fmt --check passes
  • cargo clippy --workspace --all-targets -- -D warnings passes
  • cargo test --workspace --lib — 48 tests pass (includes 17 new health check tests)
  • Verify /health/live returns 200 {"status": "alive"}
  • Verify /health/ready returns 200 with per-subsystem JSON when healthy, 503 when unhealthy
  • Verify existing /health endpoint unchanged
  • Verify Docker Compose healthcheck still works
  • Jenkins integration tests pass

🤖 Generated with Claude Code

Establish spec-kit directory structure with constitution, spec
templates (product, architecture, quality), and 11 roadmap items
(AURA-RM-001 through AURA-RM-011) covering production readiness
gaps from metrics to incident response. Define the 7-phase
development lifecycle with multi-agent review gates.

Ref: LOG-00000
Signed-off-by: brandon.shelton <brandon.shelton@mezmo.com>
AURA-RM-008: Structured Error Taxonomy
- ErrorCategory enum with 13 runtime error categories
- AuraError struct separating internal messages (logs) from
  sanitized client-facing messages (API responses)
- ErrorDetail::classified() and validation() constructors
- From impls for DetectedToolError and StreamTermination
- All 6 error construction sites updated with taxonomy codes
- Existing error_type values frozen (Article I compliance)
- 13 unit tests

AURA-RM-001: Prometheus Metrics Endpoint
- /metrics endpoint with Prometheus text exposition format
- aura_http_request_duration_seconds histogram (25ms-300s)
- aura_llm_tokens_total counter (by type/provider/agent)
- aura_mcp_tool_duration_seconds histogram (10ms-60s)
- aura_errors_total counter (by taxonomy category)
- aura_http_requests_in_flight gauge
- aura_mcp_server_connected gauge
- AURA_METRICS_ENABLED kill switch
- /health and /metrics exempt from shutdown_guard
- Cardinality guards (100 tool cap, 64-char name limit)
- 9 integration tests, 2 unit tests
- Starter Grafana dashboard (examples/grafana/)

Ref: LOG-00000
Signed-off-by: brandon.shelton <brandon.shelton@mezmo.com>
@brandon-shelton-mezmo brandon-shelton-mezmo requested a review from a team May 12, 2026 16:19
The metrics_test.rs used "model": "Test Assistant" (agent name) but
model lookup matches against alias when set. The test config has
alias = "test-assistant", so all three chat-dependent tests got 404.

Ref: LOG-00000
Signed-off-by: brandon-shelton-mezmo <brandon.shelton@mezmo.com>
Add /health/live (liveness) and /health/ready (readiness) endpoints with
per-subsystem connectivity verification for LLM providers, MCP servers,
and vector stores.

- Lightweight probes: HTTP model-listing for LLM, GET for MCP, REST
  health for Qdrant, AWS credential resolution for Bedrock
- Cached results with configurable TTL (HEALTH_CHECK_CACHE_TTL_SECS)
- Per-probe timeout (HEALTH_CHECK_TIMEOUT_SECS)
- Prometheus metrics: aura_health_ready gauge, per-subsystem gauges,
  check duration histogram
- Sanitized error messages (no internal IPs/hostnames)
- Shutdown guard exemption for /health/* paths
- Existing /health endpoint unchanged (backward compatible)
- Full spec-driven development: product, architecture, quality specs
  with multi-agent review (43 findings remediated)

Ref: LOG-00000
Signed-off-by: brandon-shelton-mezmo <brandon.shelton@mezmo.com>
@brandon-shelton-mezmo brandon-shelton-mezmo force-pushed the feature/AURA-RM-002-deep-health-checks branch from a1c1e9f to bc0d38d Compare May 12, 2026 16:22
promptless Bot added a commit that referenced this pull request May 12, 2026
Document the new /health/live and /health/ready endpoints from AURA-RM-002:
- Liveness and readiness probe endpoints with status codes
- Configuration variables for cache TTL and probe timeout
- Response format with per-subsystem status breakdown
- Subsystem probe methods for LLM, MCP, and vector stores
- Sanitized error message categories
- Health check Prometheus metrics
- Kubernetes probe configuration example

Ref: #117
@promptless
Copy link
Copy Markdown

promptless Bot commented May 12, 2026

Promptless prepared a documentation update related to this change.

Triggered by PR #117

Added docs/observability.md documenting the new /health/live and /health/ready endpoints, environment variable configuration, and Prometheus health metrics.

Review: Add observability docs

@brandon-shelton-mezmo brandon-shelton-mezmo force-pushed the feature/AURA-RM-002-deep-health-checks branch 2 times, most recently from af629ea to afbd618 Compare May 12, 2026 16:37
let duration_ms = state
.tool_start_times
.remove(&tool_result.id)
.map(|start| start.elapsed().as_millis() as u64)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unneeded cast to u64, we cast to f64 below.

Comment on lines +776 to +780
let tool_name_for_metrics = state
.tool_call_map
.get(&tool_result.id)
.map(|(name, _)| name.clone())
.unwrap_or_else(|| "unknown".to_string());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid cloning here to save an allocation:

    let tool_name_for_metrics = state
        .tool_call_map
        .get(&tool_result.id)
        .map(|(name, _)| &**name)
        .unwrap_or_else(|| "unknown");

Comment thread crates/aura-web-server/src/metrics.rs Outdated
const MAX_TOOL_NAME_LEN: usize = 64;

/// Record MCP tool call duration with cardinality guards on the tool label.
pub fn record_tool_duration(server: &str, tool: &str, status: &str, duration_secs: f64) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not take in an std::time::Duration and convert internally to what you want to display? That way we can leverage the type system to avoid someone misreading the parameter name and send you seconds or milliseconds or nanoseconds.

Comment thread crates/aura-web-server/src/metrics.rs Outdated

/// Record MCP tool call duration with cardinality guards on the tool label.
pub fn record_tool_duration(server: &str, tool: &str, status: &str, duration_secs: f64) {
let tool_label = if tool.len() > MAX_TOOL_NAME_LEN {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.len() on a string returns the amount of bytes, which is different from the character length (i.e. the width on screen) you're testing here. And even that isn't 100% if we consider graphemes.

It probably doesn't matter in this code base, where we only have tools with ASCII names, but ideally we would do something like tool.chars().count() to check the UTF-8 character count.

But wouldn't it be better to truncate?

Comment thread crates/aura-web-server/src/metrics.rs Outdated
let tool_label = if tool.len() > MAX_TOOL_NAME_LEN {
"_other"
} else {
let mut known = KNOWN_TOOLS.lock().unwrap_or_else(|e| e.into_inner());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires some more thought.

When a lock is poisoned it means that somewhere in the codebase a lock was acquired and panicked before the lock was properly dropped.

Which means that this moment we cannot make any assumptions of the contents of the lock.

  1. While the std library does it's very best to ensure that even in the case of a panic the internal state of the HashMap remains consistent, it's worth thinking about whether we want to rely on that mechanism, and we probably should document our thinking.

  2. Equally if there are multiple code-paths accessing KNOWN_TOOLS, that do multiple operations while holding a single lock, we don't know whether these changes have been atomically committed. If we do a remove and an add within the same lock, it might be that the remove was called, but the add was not.

While the HashMap is still in a valid state from Rust's point of view, it now violates the invariants we have placed on top of it.

For 1, this is something we don't really control. The only way I can see corruption there is when the thread would forcefully get shot down, which... shouldn't happen.

For 2, If we are able to validate that the usages of KNOWN_TOOLS, and the taking of the lock are atomically, or we don't care about the overall state of the HashMap itself, we can do into_inner(). However, in the case that it is poisoned (which we know based on the lock() result we probably want to clear the the poison with clear_poison.

On the other hand, if we're not able (or don't wish) to continue with a state we can't trust, we can always re-init the HashMap, losing our state.

In either case, I'd love to see an ERROR or WARN log when the lock is poisoned, because when that happens, we probably have an issue somewhere else.

// Try to unescape JSON-quoted strings for error detection
let result_text = serde_json::from_str::<String>(&tool_result.result)
.unwrap_or_else(|_| tool_result.result.clone());
let tool_name = tool_name_for_metrics.clone();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need the clone here.

Comment thread crates/aura-web-server/src/health.rs Outdated
McpServerConfig::Stdio { .. } => (
"stdio",
McpHealthResult {
transport: "stdio".to_string(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.to_owned() to get an owned instance of a &str.

Comment thread crates/aura-web-server/src/metrics.rs Outdated
}

/// Track unique tool names globally to enforce cardinality cap.
static KNOWN_TOOLS: LazyLock<Mutex<HashSet<String>>> = LazyLock::new(|| Mutex::new(HashSet::new()));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be moved to the top.

Comment thread crates/aura-web-server/src/main.rs Outdated
/// exceed this timeout are marked as unhealthy. Must be shorter than K8s
/// probe timeoutSeconds (recommend K8s timeout >= probe timeout + 5s).
#[arg(long, env = "HEALTH_CHECK_TIMEOUT_SECS", default_value = "5")]
health_check_timeout_secs: u64,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an opportunity here to have health_check_timeout as Duration, then internally we don't have to convert it. We can still name env variable as ..._SECS and convert it internally.

Same for the long option.

This way we use the type system to pass on a Duration, and no-one internally needs to worry about whether it's seconds, ...

A u64 has no provenance of whether it came from seconds or milliseconds or ...

Story about a couple of dollars lost because 2 teams used different units and stored them in a type that didn't carry that information:

An investigation attributed the failure to a measurement mismatch between two measurement systems: SI units (metric) by NASA and US customary units by spacecraft builder Lockheed Martin.

https://en.wikipedia.org/wiki/Mars_Climate_Orbiter

@kristof-mattei
Copy link
Copy Markdown
Collaborator

General remark: many places where to_string() should be replaced with to_owned().

Comment thread crates/aura-web-server/src/main.rs Outdated
Comment on lines 104 to 105
@@ -105,6 +105,16 @@ struct Args {
health_check_timeout_secs: u64,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[arg(
    env = "HEALTH_CHECK_TIMEOUT_SECS",
    default_value = "5",
    long = "health_check-timeout-secs",
    help = "...",
    value_parser = parse_duration_from_secs
)]
pub health_check_timeout: Duration,
...


fn parse_duration_from_secs(value: &str) -> Result<Duration, String> {
    let seconds = value
        .parse()
        .map_err(|error| format!("Could not parse `{}`: {}", value, error))?;

    Ok(Duration::from_secs(seconds))
}

Using this you leverage clap to do the conversion for you, and no-where in accessible code do we have access to the broadly typed version.

- metrics: Accept Duration instead of f64 for
  record_tool_duration
- metrics: Move KNOWN_TOOLS static to top of file
- metrics: Use chars().count() for UTF-8 safe length
- metrics: Log warning and clear poison on lock recovery
- handlers: Remove unnecessary u64 cast, avoid clone
- health: Replace to_string() with to_owned()
- main: Convert health check args to Duration

Ref: LOG-00000
Signed-off-by: brandon-shelton-mezmo <brandon.shelton@mezmo.com>
Use clap value_parser to parse CLI args directly into
Duration at the boundary. The raw u64 is never stored
as a field, preventing unit mismatch bugs.

Ref: LOG-00000
Signed-off-by: brandon-shelton-mezmo <brandon.shelton@mezmo.com>
@brandon-shelton-mezmo brandon-shelton-mezmo force-pushed the feature/AURA-RM-002-deep-health-checks branch from 36c1bcc to 22e974d Compare May 12, 2026 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants