diff --git a/README.md b/README.md index d8a5b0e..3ce3da4 100644 --- a/README.md +++ b/README.md @@ -531,10 +531,16 @@ For access issues, contact a repository administrator or organization owner. This project is licensed under the Apache License 2.0 - see the [LICENSE](./LICENSE) file for details. -## Related Documentation - -- [Helm Chart Documentation](./charts/README.md) -- [Contributing Guidelines](./CONTRIBUTING.md) +## Documentation + +- [Architecture](https://github.com/openshift-hyperfleet/architecture) - System architecture and API documentation +- [Metrics](docs/metrics.md) - Prometheus metric definitions and PromQL examples +- [Alerts](docs/alerts.md) - Recommended alert rules and monitoring queries +- [Runbook](docs/runbook.md) - Operational runbook for on-call engineers +- [Configuration](docs/configuration.md) - Configuration reference +- [Adapter Authoring Guide](docs/adapter-authoring-guide.md) - Guide to creating adapter task configurations +- [Helm Chart](./charts/README.md) - Helm chart documentation +- [Contributing Guidelines](./CONTRIBUTING.md) - Development and contribution guidelines ## Support diff --git a/docs/adapter-authoring-guide.md b/docs/adapter-authoring-guide.md index f57a392..3b9f305 100644 --- a/docs/adapter-authoring-guide.md +++ b/docs/adapter-authoring-guide.md @@ -1,6 +1,6 @@ # HyperFleet Adapter Authoring Guide -> A practical guide for writing adapter configurations that extend the HyperFleet cluster lifecycle platform. +> **Audience:** Developers building adapter configurations for HyperFleet cluster lifecycle tasks. --- @@ -1155,7 +1155,7 @@ clients: More information about deployment can be found in [Architecture repository - HyperFleet Adapter Framework - Deployment Guide](https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/components/adapter/framework/adapter-deployment.md) -1. **Verify broker metrics** — the adapter automatically exposes broker metrics on the `/metrics` endpoint (port 9090). No additional configuration is needed. See [Observability](observability.md) for the full list of available metrics. +1. **Verify broker metrics** — the adapter automatically exposes broker metrics on the `/metrics` endpoint (port 9090). No additional configuration is needed. See [Metrics](metrics.md) for the full list of available metrics. --- diff --git a/docs/alerts.md b/docs/alerts.md new file mode 100644 index 0000000..604e0d9 --- /dev/null +++ b/docs/alerts.md @@ -0,0 +1,118 @@ +# HyperFleet Adapter Alerts + +> **Audience:** SREs setting up monitoring and alerting for the adapter. + +This document provides recommended alerting rules and monitoring queries for the hyperfleet-adapter. + +For the canonical list of all metrics, labels, and descriptions, see [metrics.md](metrics.md). Metrics are served on port **9090** at `/metrics`. For health endpoint documentation, see [runbook.md#health-checks](runbook.md#health-checks). + +--- + +## Recommended Alerts + +### Adapter Down + +```yaml +alert: HyperFleetAdapterDown +expr: > + hyperfleet_adapter_up == 0 + or + absent(hyperfleet_adapter_up{component="hyperfleet-adapter"}) +for: 1m +labels: + severity: critical +annotations: + summary: "HyperFleet Adapter is down" + description: "Adapter {{ $labels.component }} has been down for more than 1 minute." +``` + +> **Note:** `hyperfleet_adapter_up` is explicitly set to 0 only during graceful shutdown. On crash (OOM, panic, node failure), the metric goes stale rather than becoming 0. The `absent()` clause covers this case. It will also fire if the metric has never been scraped (e.g., fresh Prometheus deployment) — expect initial noise until the adapter registers. + +### High Event Failure Rate + +```yaml +alert: HyperFleetAdapterHighFailureRate +expr: | + sum by (component, version) (rate(hyperfleet_adapter_events_processed_total{status="failed"}[5m])) + / + sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[5m])) + > 0.1 +for: 5m +labels: + severity: warning +annotations: + summary: "High event failure rate" + description: "More than 10% of events are failing for {{ $labels.component }}." +``` + +### No Events Processed (Dead Man's Switch) + +```yaml +alert: HyperFleetAdapterNoEventsProcessed +expr: | + ( + sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[15m])) == 0 + and on(component, version) hyperfleet_adapter_up == 1 + ) + or + ( + hyperfleet_adapter_up == 1 + unless on(component, version) hyperfleet_adapter_events_processed_total + ) +for: 5m +labels: + severity: warning +annotations: + summary: "No events processed" + description: "Adapter {{ $labels.component }} has not processed any events in ~20 minutes." +``` + +> **Timing:** `rate(...[15m])` takes ~15 minutes to reach zero after the last event, plus the `for: 5m` pending period. Total delay before firing is ~20 minutes. The `unless` clause handles fresh deployments where the counter has never been incremented — it fires when the adapter is up but no events metric exists yet. + +### Slow Event Processing + +```yaml +alert: HyperFleetAdapterSlowProcessing +expr: | + histogram_quantile(0.95, + sum by (component, version, le) ( + rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m]) + ) + ) > 60 +for: 5m +labels: + severity: warning +annotations: + summary: "Slow event processing" + description: "P95 event processing time exceeds 60 seconds for {{ $labels.component }}." +``` + +> **Note:** The `sum by (component, version, le)` aggregation merges histogram buckets across replicas before computing the quantile, giving a correct cluster-wide P95. Without this, each replica's P95 would be computed independently. + +### Broker Errors + +```yaml +alert: HyperFleetBrokerErrors +expr: rate(hyperfleet_broker_errors_total[5m]) > 0 +for: 5m +labels: + severity: warning +annotations: + summary: "Broker errors detected" + description: "Broker errors occurring for {{ $labels.component }}: {{ $labels.error_type }}." +``` + +### Rising Error Count by Type + +```yaml +alert: HyperFleetAdapterErrorsRising +expr: rate(hyperfleet_adapter_errors_total[5m]) > 0.5 +for: 5m +labels: + severity: warning +annotations: + summary: "Adapter errors rising" + description: "Error rate for {{ $labels.error_type }} exceeds 0.5/s on {{ $labels.component }}." +``` + +For basic metric queries and examples, see [metrics.md#example-promql-queries](metrics.md#example-promql-queries). diff --git a/docs/configuration.md b/docs/configuration.md index e77ab2b..e6b173a 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1,5 +1,7 @@ # Adapter Configuration Reference +> **Audience:** Operators deploying and configuring the hyperfleet-adapter service. + This document describes the deployment-level `AdapterConfig` options and how to set them in three formats: YAML, command-line flags, and environment variables. diff --git a/docs/metrics.md b/docs/metrics.md index cf6d7d1..6e5023b 100644 --- a/docs/metrics.md +++ b/docs/metrics.md @@ -1,173 +1,101 @@ -# HyperFleet Adapter Metrics — Alerting and Monitoring +# HyperFleet Adapter Metrics -This document provides recommended alerting rules and monitoring queries for the hyperfleet-adapter. +> **Audience:** Developers integrating with adapter metrics and SREs building dashboards. -For the canonical list of all metrics, labels, and descriptions, see [observability.md](observability.md). Metrics are served on port **9090** at `/metrics`. +All metrics are exposed on the `/metrics` endpoint (port 9090) in Prometheus format. No additional configuration is needed. ---- +The Helm chart includes a **ServiceMonitor** template for automatic discovery by the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator). It is enabled by default (`serviceMonitor.enabled: true`) and scrapes the `/metrics` endpoint every 30s with `honorLabels: true` to preserve the adapter's `component` and `version` labels. The template is only rendered when the Prometheus Operator CRDs (`monitoring.coreos.com/v1/ServiceMonitor`) are available on the cluster; otherwise it is silently skipped. See the Helm `values.yaml` for configuration options (interval, scrapeTimeout, labels, namespaceSelector). -## Health Endpoints +## Adapter Metrics -Health checks are served on port **8080** (separate from metrics). +The adapter exposes Prometheus metrics following the [HyperFleet Metrics Standard](https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/standards/metrics.md) with the `hyperfleet_adapter_` prefix. -| Endpoint | Purpose | Healthy Response | -|----------|---------|-----------------| -| `/healthz` | Liveness probe | Always `200 OK` | -| `/readyz` | Readiness probe | `200 OK` when config is loaded and broker is connected | +All adapter metrics include `component` and `version` as constant labels. -Readiness returns `503 Service Unavailable` when: -- The adapter is shutting down -- Config has not been loaded yet (`config` check) -- Broker is not connected (`broker` check) +### Baseline Metrics ---- +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `hyperfleet_adapter_build_info` | Gauge | `component`, `version`, `commit` | Build information (always 1) | +| `hyperfleet_adapter_up` | Gauge | `component`, `version` | Whether the adapter is up and running (1=up, 0=shutting down) | -## Recommended Alerts +### Event Processing Metrics -### Adapter Down +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `hyperfleet_adapter_events_processed_total` | Counter | `component`, `version`, `status` | Total CloudEvents processed. Status: `success`, `failed`, `skipped` | +| `hyperfleet_adapter_event_processing_duration_seconds` | Histogram | `component`, `version` | End-to-end event processing duration | +| `hyperfleet_adapter_errors_total` | Counter | `component`, `version`, `error_type` | Total errors by execution phase | -```yaml -alert: HyperFleetAdapterDown -expr: > - hyperfleet_adapter_up == 0 - or - absent(hyperfleet_adapter_up{component="hyperfleet-adapter"}) -for: 1m -labels: - severity: critical -annotations: - summary: "HyperFleet Adapter is down" - description: "Adapter {{ $labels.component }} has been down for more than 1 minute." -``` +#### Status Values -> **Note:** `hyperfleet_adapter_up` is explicitly set to 0 only during graceful shutdown. On crash (OOM, panic, node failure), the metric goes stale rather than becoming 0. The `absent()` clause covers this case. It will also fire if the metric has never been scraped (e.g., fresh Prometheus deployment) — expect initial noise until the adapter registers. +| Status | Description | +|--------|-------------| +| `success` | Event processed successfully with resources applied | +| `skipped` | Event processed successfully but resources skipped (preconditions not met) | +| `failed` | Event processing failed due to an error | -### High Event Failure Rate +#### Error Types -```yaml -alert: HyperFleetAdapterHighFailureRate -expr: | - sum by (component, version) (rate(hyperfleet_adapter_events_processed_total{status="failed"}[5m])) - / - sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[5m])) - > 0.1 -for: 5m -labels: - severity: warning -annotations: - summary: "High event failure rate" - description: "More than 10% of events are failing for {{ $labels.component }}." -``` +The `error_type` label on `hyperfleet_adapter_errors_total` corresponds to the execution phase where the error occurred: -### No Events Processed (Dead Man's Switch) +| Error Type | Description | +|------------|-------------| +| `param_extraction` | Failed to extract parameters from the event | +| `preconditions` | Precondition evaluation error (not the same as precondition not met) | +| `resources` | Failed to apply Kubernetes resources | +| `post_actions` | Failed to execute post-actions (e.g., status reporting) | -```yaml -alert: HyperFleetAdapterNoEventsProcessed -expr: | - ( - sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[15m])) == 0 - or - absent(hyperfleet_adapter_events_processed_total) == 1 - ) - and on(component, version) - hyperfleet_adapter_up == 1 -for: 5m -labels: - severity: warning -annotations: - summary: "No events processed" - description: "Adapter {{ $labels.component }} has not processed any events in ~20 minutes." -``` +#### Histogram Buckets -> **Timing:** `rate(...[15m])` takes ~15 minutes to reach zero after the last event, plus the `for: 5m` pending period. Total delay before firing is ~20 minutes. The `absent()` clause handles fresh deployments where the counter has never been incremented. - -### Slow Event Processing - -```yaml -alert: HyperFleetAdapterSlowProcessing -expr: | - histogram_quantile(0.95, - sum by (component, version, le) ( - rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m]) - ) - ) > 60 -for: 5m -labels: - severity: warning -annotations: - summary: "Slow event processing" - description: "P95 event processing time exceeds 60 seconds for {{ $labels.component }}." -``` +The `event_processing_duration_seconds` histogram uses the following buckets (in seconds), as recommended by the [adapter metrics standard](https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/components/adapter/framework/adapter-metrics.md): -> **Note:** The `sum by (component, version, le)` aggregation merges histogram buckets across replicas before computing the quantile, giving a correct cluster-wide P95. Without this, each replica's P95 would be computed independently. - -### Broker Errors - -```yaml -alert: HyperFleetBrokerErrors -expr: rate(hyperfleet_broker_errors_total[5m]) > 0 -for: 5m -labels: - severity: warning -annotations: - summary: "Broker errors detected" - description: "Broker errors occurring for {{ $labels.component }}: {{ $labels.error_type }}." +```text +0.1, 0.5, 1, 2, 5, 10, 30, 60, 120 ``` -### Rising Error Count by Type - -```yaml -alert: HyperFleetAdapterErrorsRising -expr: rate(hyperfleet_adapter_errors_total[5m]) > 0.5 -for: 5m -labels: - severity: warning -annotations: - summary: "Adapter errors rising" - description: "Error rate for {{ $labels.error_type }} exceeds 0.5/s on {{ $labels.component }}." -``` - ---- - -## Example PromQL Queries +### Example PromQL Queries -### Event throughput +Event processing success rate: ```promql -rate(hyperfleet_adapter_events_processed_total[5m]) +( + sum(rate(hyperfleet_adapter_events_processed_total{status="success"}[5m])) + / + sum(rate(hyperfleet_adapter_events_processed_total[5m])) +) * 100 ``` -### Event success rate (percentage) +p95 event processing duration: ```promql -sum by (component, version) (rate(hyperfleet_adapter_events_processed_total{status="success"}[5m])) -/ -sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[5m])) -* 100 +histogram_quantile(0.95, + sum by (component, version, le) ( + rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m]) + ) +) ``` -### P50 / P95 / P99 event processing latency +Error rate by phase: ```promql -histogram_quantile(0.50, rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m])) -histogram_quantile(0.95, rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m])) -histogram_quantile(0.99, rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m])) +sum by (error_type) (rate(hyperfleet_adapter_errors_total[5m])) ``` -### Errors by type +## Broker Metrics -```promql -sum by (error_type) (rate(hyperfleet_adapter_errors_total[5m])) -``` +The adapter automatically registers Prometheus metrics from the [hyperfleet-broker](https://github.com/openshift-hyperfleet/hyperfleet-broker) library. -### Broker message consumption rate +### Available Metrics -```promql -rate(hyperfleet_broker_messages_consumed_total[5m]) -``` +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `hyperfleet_broker_messages_consumed_total` | Counter | `topic`, `component`, `version` | Total messages consumed from the broker | +| `hyperfleet_broker_errors_total` | Counter | `topic`, `error_type`, `component`, `version` | Total message processing errors | +| `hyperfleet_broker_message_duration_seconds` | Histogram | `topic`, `component`, `version` | Message processing duration | -### Currently running adapter version +These metrics use the `hyperfleet_broker_` prefix and include the adapter's `component` and `version` labels. -```promql -hyperfleet_adapter_build_info -``` +## Alerting and Monitoring + +For recommended alerting rules, thresholds, and operational PromQL queries, see [alerts.md](alerts.md). diff --git a/docs/observability.md b/docs/observability.md deleted file mode 100644 index a49c7a9..0000000 --- a/docs/observability.md +++ /dev/null @@ -1,97 +0,0 @@ -# Observability - -All metrics are exposed on the `/metrics` endpoint (port 9090) in Prometheus format. No additional configuration is needed. - -The Helm chart includes a **ServiceMonitor** template for automatic discovery by the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator). It is enabled by default (`serviceMonitor.enabled: true`) and scrapes the `/metrics` endpoint every 30s with `honorLabels: true` to preserve the adapter's `component` and `version` labels. The template is only rendered when the Prometheus Operator CRDs (`monitoring.coreos.com/v1/ServiceMonitor`) are available on the cluster; otherwise it is silently skipped. See the Helm `values.yaml` for configuration options (interval, scrapeTimeout, labels, namespaceSelector). - -## Adapter Metrics - -The adapter exposes Prometheus metrics following the [HyperFleet Metrics Standard](https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/standards/metrics.md) with the `hyperfleet_adapter_` prefix. - -All adapter metrics include `component` and `version` as constant labels. - -### Baseline Metrics - -| Metric | Type | Labels | Description | -|--------|------|--------|-------------| -| `hyperfleet_adapter_build_info` | Gauge | `component`, `version`, `commit` | Build information (always 1) | -| `hyperfleet_adapter_up` | Gauge | `component`, `version` | Whether the adapter is up and running (1=up, 0=shutting down) | - -### Event Processing Metrics - -| Metric | Type | Labels | Description | -|--------|------|--------|-------------| -| `hyperfleet_adapter_events_processed_total` | Counter | `component`, `version`, `status` | Total CloudEvents processed. Status: `success`, `failed`, `skipped` | -| `hyperfleet_adapter_event_processing_duration_seconds` | Histogram | `component`, `version` | End-to-end event processing duration | -| `hyperfleet_adapter_errors_total` | Counter | `component`, `version`, `error_type` | Total errors by execution phase | - -#### Status Values - -| Status | Description | -|--------|-------------| -| `success` | Event processed successfully with resources applied | -| `skipped` | Event processed successfully but resources skipped (preconditions not met) | -| `failed` | Event processing failed due to an error | - -#### Error Types - -The `error_type` label on `hyperfleet_adapter_errors_total` corresponds to the execution phase where the error occurred: - -| Error Type | Description | -|------------|-------------| -| `param_extraction` | Failed to extract parameters from the event | -| `preconditions` | Precondition evaluation error (not the same as precondition not met) | -| `resources` | Failed to apply Kubernetes resources | -| `post_actions` | Failed to execute post-actions (e.g., status reporting) | - -#### Histogram Buckets - -The `event_processing_duration_seconds` histogram uses the following buckets (in seconds), as recommended by the [adapter metrics standard](https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/components/adapter/framework/adapter-metrics.md): - -```text -0.1, 0.5, 1, 2, 5, 10, 30, 60, 120 -``` - -### Example PromQL Queries - -Event processing success rate: - -```promql -( - sum(rate(hyperfleet_adapter_events_processed_total{status="success"}[5m])) - / - sum(rate(hyperfleet_adapter_events_processed_total[5m])) -) * 100 -``` - -p95 event processing duration: - -```promql -histogram_quantile(0.95, - rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m]) -) -``` - -Error rate by phase: - -```promql -sum by (error_type) (rate(hyperfleet_adapter_errors_total[5m])) -``` - -## Broker Metrics - -The adapter automatically registers Prometheus metrics from the [hyperfleet-broker](https://github.com/openshift-hyperfleet/hyperfleet-broker) library. - -### Available Metrics - -| Metric | Type | Labels | Description | -|--------|------|--------|-------------| -| `hyperfleet_broker_messages_consumed_total` | Counter | `topic`, `component`, `version` | Total messages consumed from the broker | -| `hyperfleet_broker_errors_total` | Counter | `topic`, `error_type`, `component`, `version` | Total message processing errors | -| `hyperfleet_broker_message_duration_seconds` | Histogram | `topic`, `component`, `version` | Message processing duration | - -These metrics use the `hyperfleet_broker_` prefix and include the adapter's `component` and `version` labels. - -## Alerting and Monitoring - -For recommended alerting rules, thresholds, and operational PromQL queries, see [metrics.md](metrics.md). diff --git a/docs/runbook.md b/docs/runbook.md index 10fd78d..55bd77e 100644 --- a/docs/runbook.md +++ b/docs/runbook.md @@ -1,6 +1,6 @@ # HyperFleet Adapter Runbook -Operational runbook for on-call engineers managing the hyperfleet-adapter service. +> **Audience:** On-call engineers managing the hyperfleet-adapter service. ---