openshift-hyperfleet · xueli181114 · Mar 12, 2026
diff --git a/README.md b/README.md
@@ -531,10 +531,16 @@ For access issues, contact a repository administrator or organization owner.
 
 This project is licensed under the Apache License 2.0 - see the [LICENSE](./LICENSE) file for details.
 
-## Related Documentation
-
-- [Helm Chart Documentation](./charts/README.md)
-- [Contributing Guidelines](./CONTRIBUTING.md)
+## Documentation
+
+- [Architecture](https://github.com/openshift-hyperfleet/architecture) - System architecture and API documentation
+- [Metrics](docs/metrics.md) - Prometheus metric definitions and PromQL examples
+- [Alerts](docs/alerts.md) - Recommended alert rules and monitoring queries
+- [Runbook](docs/runbook.md) - Operational runbook for on-call engineers
+- [Configuration](docs/configuration.md) - Configuration reference
+- [Adapter Authoring Guide](docs/adapter-authoring-guide.md) - Guide to creating adapter task configurations
+- [Helm Chart](./charts/README.md) - Helm chart documentation
+- [Contributing Guidelines](./CONTRIBUTING.md) - Development and contribution guidelines
 
 ## Support
 

diff --git a/docs/adapter-authoring-guide.md b/docs/adapter-authoring-guide.md
@@ -1,6 +1,6 @@
 # HyperFleet Adapter Authoring Guide
 
-> A practical guide for writing adapter configurations that extend the HyperFleet cluster lifecycle platform.
+> **Audience:** Developers building adapter configurations for HyperFleet cluster lifecycle tasks.
 
 ---
 
@@ -1155,7 +1155,7 @@ clients:
 
 More information about deployment can be found in [Architecture repository - HyperFleet Adapter Framework - Deployment Guide](https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/components/adapter/framework/adapter-deployment.md)
 
-1. **Verify broker metrics** — the adapter automatically exposes broker metrics on the `/metrics` endpoint (port 9090). No additional configuration is needed. See [Observability](observability.md) for the full list of available metrics.
+1. **Verify broker metrics** — the adapter automatically exposes broker metrics on the `/metrics` endpoint (port 9090). No additional configuration is needed. See [Metrics](metrics.md) for the full list of available metrics.
 
 ---
 

diff --git a/docs/alerts.md b/docs/alerts.md
@@ -0,0 +1,118 @@
+# HyperFleet Adapter Alerts
+
+> **Audience:** SREs setting up monitoring and alerting for the adapter.
+
+This document provides recommended alerting rules and monitoring queries for the hyperfleet-adapter.
+
+For the canonical list of all metrics, labels, and descriptions, see [metrics.md](metrics.md). Metrics are served on port **9090** at `/metrics`. For health endpoint documentation, see [runbook.md#health-checks](runbook.md#health-checks).
+
+---
+
+## Recommended Alerts
+
+### Adapter Down
+
+```yaml
+alert: HyperFleetAdapterDown
+expr: >
+  hyperfleet_adapter_up == 0
+  or
+  absent(hyperfleet_adapter_up{component="hyperfleet-adapter"})
+for: 1m
+labels:
+  severity: critical
+annotations:
+  summary: "HyperFleet Adapter is down"
+  description: "Adapter {{ $labels.component }} has been down for more than 1 minute."
+```
+
+> **Note:** `hyperfleet_adapter_up` is explicitly set to 0 only during graceful shutdown. On crash (OOM, panic, node failure), the metric goes stale rather than becoming 0. The `absent()` clause covers this case. It will also fire if the metric has never been scraped (e.g., fresh Prometheus deployment) — expect initial noise until the adapter registers.
+
+### High Event Failure Rate
+
+```yaml
+alert: HyperFleetAdapterHighFailureRate
+expr: |
+  sum by (component, version) (rate(hyperfleet_adapter_events_processed_total{status="failed"}[5m]))
+  /
+  sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[5m]))
+  > 0.1
+for: 5m
+labels:
+  severity: warning
+annotations:
+  summary: "High event failure rate"
+  description: "More than 10% of events are failing for {{ $labels.component }}."
+```
+
+### No Events Processed (Dead Man's Switch)
+
+```yaml
+alert: HyperFleetAdapterNoEventsProcessed
+expr: |
+  (
+    sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[15m])) == 0
+    and on(component, version) hyperfleet_adapter_up == 1
+  )
+  or
+  (
+    hyperfleet_adapter_up == 1
+    unless on(component, version) hyperfleet_adapter_events_processed_total
+  )
+for: 5m
+labels:
+  severity: warning
+annotations:
+  summary: "No events processed"
+  description: "Adapter {{ $labels.component }} has not processed any events in ~20 minutes."
+```
+
+> **Timing:** `rate(...[15m])` takes ~15 minutes to reach zero after the last event, plus the `for: 5m` pending period. Total delay before firing is ~20 minutes. The `unless` clause handles fresh deployments where the counter has never been incremented — it fires when the adapter is up but no events metric exists yet.
+
+### Slow Event Processing
+
+```yaml
+alert: HyperFleetAdapterSlowProcessing
+expr: |
+  histogram_quantile(0.95,
+    sum by (component, version, le) (
+      rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m])
+    )
+  ) > 60
+for: 5m
+labels:
+  severity: warning
+annotations:
+  summary: "Slow event processing"
+  description: "P95 event processing time exceeds 60 seconds for {{ $labels.component }}."
+```
+
+> **Note:** The `sum by (component, version, le)` aggregation merges histogram buckets across replicas before computing the quantile, giving a correct cluster-wide P95. Without this, each replica's P95 would be computed independently.
+
+### Broker Errors
+
+```yaml
+alert: HyperFleetBrokerErrors
+expr: rate(hyperfleet_broker_errors_total[5m]) > 0
+for: 5m
+labels:
+  severity: warning
+annotations:
+  summary: "Broker errors detected"
+  description: "Broker errors occurring for {{ $labels.component }}: {{ $labels.error_type }}."
+```
+
+### Rising Error Count by Type
+
+```yaml
+alert: HyperFleetAdapterErrorsRising
+expr: rate(hyperfleet_adapter_errors_total[5m]) > 0.5
+for: 5m
+labels:
+  severity: warning
+annotations:
+  summary: "Adapter errors rising"
+  description: "Error rate for {{ $labels.error_type }} exceeds 0.5/s on {{ $labels.component }}."
+```
+
+For basic metric queries and examples, see [metrics.md#example-promql-queries](metrics.md#example-promql-queries).
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -1,5 +1,7 @@
 # Adapter Configuration Reference
 
+> **Audience:** Operators deploying and configuring the hyperfleet-adapter service.
+
 This document describes the deployment-level `AdapterConfig` options and how to set them
 in three formats: YAML, command-line flags, and environment variables.