Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -531,10 +531,16 @@ For access issues, contact a repository administrator or organization owner.

This project is licensed under the Apache License 2.0 - see the [LICENSE](./LICENSE) file for details.

## Related Documentation

- [Helm Chart Documentation](./charts/README.md)
- [Contributing Guidelines](./CONTRIBUTING.md)
## Documentation

- [Architecture](https://github.com/openshift-hyperfleet/architecture) - System architecture and API documentation
- [Metrics](docs/metrics.md) - Prometheus metric definitions and PromQL examples
- [Alerts](docs/alerts.md) - Recommended alert rules and monitoring queries
- [Runbook](docs/runbook.md) - Operational runbook for on-call engineers
- [Configuration](docs/configuration.md) - Configuration reference
- [Adapter Authoring Guide](docs/adapter-authoring-guide.md) - Guide to creating adapter task configurations
- [Helm Chart](./charts/README.md) - Helm chart documentation
- [Contributing Guidelines](./CONTRIBUTING.md) - Development and contribution guidelines

## Support

Expand Down
4 changes: 2 additions & 2 deletions docs/adapter-authoring-guide.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# HyperFleet Adapter Authoring Guide

> A practical guide for writing adapter configurations that extend the HyperFleet cluster lifecycle platform.
> **Audience:** Developers building adapter configurations for HyperFleet cluster lifecycle tasks.

---

Expand Down Expand Up @@ -1155,7 +1155,7 @@ clients:

More information about deployment can be found in [Architecture repository - HyperFleet Adapter Framework - Deployment Guide](https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/components/adapter/framework/adapter-deployment.md)

1. **Verify broker metrics** — the adapter automatically exposes broker metrics on the `/metrics` endpoint (port 9090). No additional configuration is needed. See [Observability](observability.md) for the full list of available metrics.
1. **Verify broker metrics** — the adapter automatically exposes broker metrics on the `/metrics` endpoint (port 9090). No additional configuration is needed. See [Metrics](metrics.md) for the full list of available metrics.

---

Expand Down
118 changes: 118 additions & 0 deletions docs/alerts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# HyperFleet Adapter Alerts

> **Audience:** SREs setting up monitoring and alerting for the adapter.

This document provides recommended alerting rules and monitoring queries for the hyperfleet-adapter.

For the canonical list of all metrics, labels, and descriptions, see [metrics.md](metrics.md). Metrics are served on port **9090** at `/metrics`. For health endpoint documentation, see [runbook.md#health-checks](runbook.md#health-checks).

---

## Recommended Alerts

### Adapter Down

```yaml
alert: HyperFleetAdapterDown
expr: >
hyperfleet_adapter_up == 0
or
absent(hyperfleet_adapter_up{component="hyperfleet-adapter"})
for: 1m
labels:
severity: critical
annotations:
summary: "HyperFleet Adapter is down"
description: "Adapter {{ $labels.component }} has been down for more than 1 minute."
```

> **Note:** `hyperfleet_adapter_up` is explicitly set to 0 only during graceful shutdown. On crash (OOM, panic, node failure), the metric goes stale rather than becoming 0. The `absent()` clause covers this case. It will also fire if the metric has never been scraped (e.g., fresh Prometheus deployment) — expect initial noise until the adapter registers.

### High Event Failure Rate

```yaml
alert: HyperFleetAdapterHighFailureRate
expr: |
sum by (component, version) (rate(hyperfleet_adapter_events_processed_total{status="failed"}[5m]))
/
sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[5m]))
> 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High event failure rate"
description: "More than 10% of events are failing for {{ $labels.component }}."
```

### No Events Processed (Dead Man's Switch)

```yaml
alert: HyperFleetAdapterNoEventsProcessed
expr: |
(
sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[15m])) == 0
and on(component, version) hyperfleet_adapter_up == 1
)
or
(
hyperfleet_adapter_up == 1
unless on(component, version) hyperfleet_adapter_events_processed_total
)
for: 5m
labels:
severity: warning
annotations:
summary: "No events processed"
description: "Adapter {{ $labels.component }} has not processed any events in ~20 minutes."
```

> **Timing:** `rate(...[15m])` takes ~15 minutes to reach zero after the last event, plus the `for: 5m` pending period. Total delay before firing is ~20 minutes. The `unless` clause handles fresh deployments where the counter has never been incremented — it fires when the adapter is up but no events metric exists yet.

### Slow Event Processing

```yaml
alert: HyperFleetAdapterSlowProcessing
expr: |
histogram_quantile(0.95,
sum by (component, version, le) (
rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m])
)
) > 60
for: 5m
labels:
severity: warning
annotations:
summary: "Slow event processing"
description: "P95 event processing time exceeds 60 seconds for {{ $labels.component }}."
```

> **Note:** The `sum by (component, version, le)` aggregation merges histogram buckets across replicas before computing the quantile, giving a correct cluster-wide P95. Without this, each replica's P95 would be computed independently.

### Broker Errors

```yaml
alert: HyperFleetBrokerErrors
expr: rate(hyperfleet_broker_errors_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Broker errors detected"
description: "Broker errors occurring for {{ $labels.component }}: {{ $labels.error_type }}."
```

### Rising Error Count by Type

```yaml
alert: HyperFleetAdapterErrorsRising
expr: rate(hyperfleet_adapter_errors_total[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Adapter errors rising"
description: "Error rate for {{ $labels.error_type }} exceeds 0.5/s on {{ $labels.component }}."
```

For basic metric queries and examples, see [metrics.md#example-promql-queries](metrics.md#example-promql-queries).
2 changes: 2 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Adapter Configuration Reference

> **Audience:** Operators deploying and configuring the hyperfleet-adapter service.

This document describes the deployment-level `AdapterConfig` options and how to set them
in three formats: YAML, command-line flags, and environment variables.

Expand Down
Loading