From 4ac609269c545c29828616db63da54b24ae32840 Mon Sep 17 00:00:00 2001 From: Rafael Benevides Date: Tue, 17 Mar 2026 16:27:42 -0300 Subject: [PATCH 1/3] HYPERFLEET-537 - docs: replace conditions config with message_decision Replace the old conditions-based configuration (reference_time + rules with per-rule max_age) with the new message_decision format (named params with CEL expressions + boolean result expression) across all architecture documents. --- .../architecture/architecture-summary.md | 56 ++- .../framework/adapter-flow-diagrams.md | 4 +- .../components/sentinel/sentinel-config.yaml | 27 +- .../sentinel/sentinel-deployment.md | 21 +- hyperfleet/components/sentinel/sentinel.md | 380 +++++++++++------- 5 files changed, 307 insertions(+), 181 deletions(-) diff --git a/hyperfleet/architecture/architecture-summary.md b/hyperfleet/architecture/architecture-summary.md index 7c0e778..89a49bd 100644 --- a/hyperfleet/architecture/architecture-summary.md +++ b/hyperfleet/architecture/architecture-summary.md @@ -208,7 +208,7 @@ cluster_statuses **Why**: - **Centralized Orchestration Logic**: Single component decides "when" to reconcile -- **Simple Max Age Strategy**: Time-based decisions using status.last_updated_time (updated on every adapter check) +- **Configurable Message Decision**: CEL-based decision logic with named params and boolean result expressions - **Horizontal Scalability**: Sharding via label selectors (by region, environment, etc.) - **Broker Abstraction**: Pluggable event publishers (GCP Pub/Sub, RabbitMQ, Stub) - **Self-Healing**: Continuously retries without manual intervention @@ -216,9 +216,8 @@ cluster_statuses **Responsibilities**: 1. **Fetch Resources**: Poll HyperFleet API for resources matching shard selector 2. **Decision Logic**: Determine if resource needs reconciliation based on: - - `status.phase` (Ready vs Not Ready) - - `status.last_updated_time` (time since last adapter check) - - Configured max age intervals (10s for not-ready, 30m for ready) + - Generation check: `resource.generation > observedGeneration` triggers immediate reconciliation + - Configurable message decision with named params (CEL expressions) and a boolean result expression 3. **Event Creation**: Create reconciliation event with resource context 4. **Event Publishing**: Publish event to configured message broker 5. **Metrics & Observability**: Expose Prometheus metrics for monitoring @@ -228,8 +227,15 @@ cluster_statuses # sentinel-config.yaml (ConfigMap) resource_type: clusters poll_interval: 5s -max_age_not_ready: 10s -max_age_ready: 30m + +message_decision: + params: + ref_time: 'conditionTime(resource, "Ready")' + is_ready: 'status(resource, "Ready") == "True"' + age_exceeded_ready: 'is_ready && now - timestamp(ref_time) > duration("30m")' + age_exceeded_not_ready: '!is_ready && now - timestamp(ref_time) > duration("10s")' + result: age_exceeded_ready OR age_exceeded_not_ready + resource_selector: - label: region value: us-east @@ -259,14 +265,14 @@ data: **Decision Algorithm**: ``` FOR EACH resource in FetchResources(resourceType, resourceSelector): - IF resource.status.phase != "Ready": - max_age = max_age_not_ready (10s) + IF resource.generation > observedGeneration: + PublishEvent(broker, CreateEvent(resource)) // immediate reconciliation ELSE: - max_age = max_age_ready (30m) + Evaluate message_decision params (CEL expressions, dependency-ordered) + Evaluate result expression (combines params with AND/OR) - IF now >= resource.status.last_updated_time + max_age: - event = CreateEvent(resource) - PublishEvent(broker, event) + IF result == true: + PublishEvent(broker, CreateEvent(resource)) ``` **Benefits**: @@ -541,7 +547,7 @@ sequenceDiagram DB-->>API: [cluster list] API-->>Sentinel: [{id, status, ...}] - Note over Sentinel: Decision: phase != "Ready" &&
last_updated_time + 10s < now + Note over Sentinel: Decision: generation >
observedGeneration? OR
message_decision result == true? Sentinel->>Broker: Publish event
{resourceType: "clusters",
resourceId: "cls-123"} @@ -564,12 +570,12 @@ sequenceDiagram DB-->>API: ClusterStatus saved API-->>Adapter: 201 Created - Note over Sentinel: Next poll cycle (10s later) + Note over Sentinel: Next poll cycle Sentinel->>API: GET /clusters API-->>Sentinel: [{id, status.last_updated_time = now(), ...}] - Note over Sentinel: Decision: Create event again
(cycle continues for other adapters) + Note over Sentinel: Decision: evaluate message_decision
(cycle continues for other adapters) Sentinel->>Broker: Publish event ``` @@ -610,7 +616,7 @@ sequenceDiagram DB-->>API: [cluster list with updated last_updated_time] API-->>Sentinel: [{id, status, ...}] - Note over Sentinel: Decision: Create event
if max-age expired + Note over Sentinel: Decision: generation >
observedGeneration? OR
message_decision result == true? ``` --- @@ -737,8 +743,13 @@ See [Status Guide](../docs/status-guide.md) for complete details on the status c # sentinel-us-east-config.yaml (ConfigMap) resource_type: clusters poll_interval: 5s -max_age_not_ready: 10s -max_age_ready: 30m +message_decision: + params: + ref_time: 'conditionTime(resource, "Ready")' + is_ready: 'status(resource, "Ready") == "True"' + age_exceeded_ready: 'is_ready && now - timestamp(ref_time) > duration("30m")' + age_exceeded_not_ready: '!is_ready && now - timestamp(ref_time) > duration("10s")' + result: age_exceeded_ready OR age_exceeded_not_ready resource_selector: - label: region value: us-east @@ -758,8 +769,13 @@ message_data: # sentinel-eu-west-config.yaml (ConfigMap) resource_type: clusters poll_interval: 5s -max_age_not_ready: 15s -max_age_ready: 1h +message_decision: + params: + ref_time: 'conditionTime(resource, "Ready")' + is_ready: 'status(resource, "Ready") == "True"' + age_exceeded_ready: 'is_ready && now - timestamp(ref_time) > duration("1h")' + age_exceeded_not_ready: '!is_ready && now - timestamp(ref_time) > duration("15s")' + result: age_exceeded_ready OR age_exceeded_not_ready resource_selector: - label: region value: eu-west diff --git a/hyperfleet/components/adapter/framework/adapter-flow-diagrams.md b/hyperfleet/components/adapter/framework/adapter-flow-diagrams.md index bdfd728..3d68877 100644 --- a/hyperfleet/components/adapter/framework/adapter-flow-diagrams.md +++ b/hyperfleet/components/adapter/framework/adapter-flow-diagrams.md @@ -90,9 +90,9 @@ sequenceDiagram S->>API: GET /api/hyperfleet/v1/clusters?labels=shard API-->>S: List of clusters - Note over S: For each cluster:
Check if requires event?
(10s for Not Ready, 30m for Ready) + Note over S: For each cluster:
Check generation, then evaluate
message_decision result - S->>S: Evaluate: now >= lastEventTime + max_age + S->>S: Evaluate: message_decision params + result alt Requires event S->>B: Publish CloudEvent
{resourceType: "clusters", resourceId: "cls-123"} diff --git a/hyperfleet/components/sentinel/sentinel-config.yaml b/hyperfleet/components/sentinel/sentinel-config.yaml index 8cdf326..45d2d79 100644 --- a/hyperfleet/components/sentinel/sentinel-config.yaml +++ b/hyperfleet/components/sentinel/sentinel-config.yaml @@ -10,8 +10,31 @@ resource_type: clusters # Resource to watch: clusters, nodepools, manifests, wo # === POLLING CONFIGURATION === poll_interval: 5s # How often to check the API -max_age_not_ready: 10s # Wait time for resources still provisioning -max_age_ready: 30m # Wait time for stable/ready resources + +# === MESSAGE DECISION === +# Configurable decision logic that determines when to publish reconciliation events. +# Uses named params (CEL expressions or duration literals) and a boolean result expression. +# Params can reference other params (evaluated in dependency order). +# The result expression combines params with AND/OR operators. +# All expressions are compiled at startup; invalid expressions cause fail-fast. +# +# Available CEL variables: +# resource - the resource map (id, kind, status, status.conditions, etc.) +# now - current timestamp +# +# Available CEL helper functions: +# condition(resource, type) → map - returns full condition map for a given type +# status(resource, type) → string - returns the status string of a condition +# conditionTime(resource, type) → string - returns last_updated_time (RFC3339) +# +# Duration literals (e.g., 30m, 10s) are auto-detected and made available as CEL durations. +message_decision: + params: + ref_time: 'conditionTime(resource, "Ready")' + is_ready: 'status(resource, "Ready") == "True"' + age_exceeded_ready: 'is_ready && now - timestamp(ref_time) > duration("30m")' + age_exceeded_not_ready: '!is_ready && now - timestamp(ref_time) > duration("10s")' + result: age_exceeded_ready OR age_exceeded_not_ready # === RESOURCE SELECTOR (optional) === # Label selector for filtering which resources this Sentinel instance monitors diff --git a/hyperfleet/components/sentinel/sentinel-deployment.md b/hyperfleet/components/sentinel/sentinel-deployment.md index 8265727..fa895a0 100644 --- a/hyperfleet/components/sentinel/sentinel-deployment.md +++ b/hyperfleet/components/sentinel/sentinel-deployment.md @@ -40,6 +40,10 @@ spec: - --config=/etc/sentinel/config.yaml # Path to YAML config file - --metrics-bind-address=:9090 - --health-probe-bind-address=:8080 + envFrom: + # Broker configuration (BROKER_TYPE, BROKER_PROJECT_ID, BROKER_HOST, etc.) + - configMapRef: + name: hyperfleet-sentinel-broker env: # HYPERFLEET_API_TOKEN="secret-token" - name: HYPERFLEET_API_TOKEN @@ -53,7 +57,7 @@ spec: configMapKeyRef: name: sentinel-config key: gcp-project-id - # BROKER_CREDENTIALS="path/to/credentials.json" + # BROKER_CREDENTIALS="path/to/credentials.json" (only if broker requires secret credentials) - name: BROKER_CREDENTIALS valueFrom: secretKeyRef: @@ -122,8 +126,15 @@ data: config.yaml: | resource_type: clusters poll_interval: 5s - max_age_not_ready: 10s - max_age_ready: 30m + + message_decision: + params: + ref_time: 'conditionTime(resource, "Ready")' + is_ready: 'status(resource, "Ready") == "True"' + age_exceeded_ready: 'is_ready && now - timestamp(ref_time) > duration("30m")' + age_exceeded_not_ready: '!is_ready && now - timestamp(ref_time) > duration("10s")' + result: age_exceeded_ready OR age_exceeded_not_ready + resource_selector: - label: region value: us-east @@ -201,7 +212,7 @@ The Sentinel service must expose the following Prometheus metrics: |-------------|------|--------|-------------| | `hyperfleet_sentinel_pending_resources` | Gauge | `component`, `version`, `resource_selector`, `resource_type` | Number of resources matching this selector | | `hyperfleet_sentinel_events_published_total` | Counter | `component`, `version`, `resource_selector`, `resource_type` | Total number of events published to broker | -| `hyperfleet_sentinel_resources_skipped_total` | Counter | `component`, `version`, `resource_selector`, `resource_type`, `ready_state` | Total number of resources skipped due to backoff | +| `hyperfleet_sentinel_resources_skipped_total` | Counter | `component`, `version`, `resource_selector`, `resource_type`, `reason` | Total number of resources skipped (matched rule name or "no_match" fallback) | | `hyperfleet_sentinel_poll_duration_seconds` | Histogram | `component`, `version`, `resource_selector`, `resource_type` | Time spent in each polling cycle | | `hyperfleet_sentinel_api_errors_total` | Counter | `component`, `version`, `resource_selector`, `resource_type`, `operation` | Total API errors by operation (fetch_resources, config_load) | | `hyperfleet_sentinel_broker_errors_total` | Counter | `component`, `version`, `resource_selector`, `resource_type`, `broker_type` | Total broker publishing errors | @@ -212,7 +223,7 @@ The Sentinel service must expose the following Prometheus metrics: - All metrics must include `component` and `version` labels (see [Metrics Standard](../../standards/metrics.md)) - All metrics must include `resource_selector` label (from resource_selector string) - All metrics must include `resource_type` label (from configuration resource_type field) -- `ready_state` label values: "ready" or "not_ready" +- `reason` label values: decision reason (e.g., "generation_mismatch", "message_decision" for published; param-specific reasons for skipped) - `operation` label values: "fetch_resources", "config_load" - `broker_type` label values: "pubsub", "rabbitmq" - Expose metrics endpoint on port 9090 at `/metrics` diff --git a/hyperfleet/components/sentinel/sentinel.md b/hyperfleet/components/sentinel/sentinel.md index 386cd20..40bb4a2 100644 --- a/hyperfleet/components/sentinel/sentinel.md +++ b/hyperfleet/components/sentinel/sentinel.md @@ -2,7 +2,7 @@ **What** -Implement a "HyperFleet Sentinel" service that continuously polls the HyperFleet API for resources (clusters, node pools, etc.) and publishes reconciliation events directly to the message broker to trigger adapter processing. The Sentinel acts as the "watchful guardian" of the HyperFleet system with simple, configurable max age intervals. Multiple Sentinel deployments can be configured via YAML configuration files to handle different shards of resources for horizontal scalability. +Implement a "HyperFleet Sentinel" service that continuously polls the HyperFleet API for resources (clusters, node pools, etc.) and publishes reconciliation events directly to the message broker to trigger adapter processing. The Sentinel acts as the "watchful guardian" of the HyperFleet system with configurable message decision logic using CEL expressions and composable boolean params. Multiple Sentinel deployments can be configured via YAML configuration files to handle different shards of resources for horizontal scalability. **Pattern Reusability**: The Sentinel is designed as a generic reconciliation service that can watch ANY HyperFleet resource type, not just clusters. Future deployments can include: - **Cluster Sentinel** (this epic) - watches clusters @@ -21,9 +21,9 @@ Without the Sentinel, the cluster provisioning workflow has a critical gap: The Sentinel solves these problems by: - **Closing the reconciliation loop**: Continuously polls resources and publishes events to trigger adapter evaluation - **Generation-based reconciliation**: Immediately triggers reconciliation when resource spec changes (generation increments), ensuring responsive updates -- **Uses adapter status updates**: Reads `status.last_updated_time` and `status.observed_generation` (updated by adapters on every check) to determine when to create next event -- **Smart triggering**: Two-tier decision logic prioritizes spec changes (generation mismatch) over periodic health checks (max age) -- **Simple max age intervals**: 10 seconds for non-ready resources, 30 minutes for ready resources (configurable) +- **Uses adapter status updates**: Reads `status.conditions[].last_updated_time` and `status.conditions[].observed_generation` (updated by adapters on every check) to determine when to create next event +- **Smart triggering**: Two-tier decision logic prioritizes spec changes (generation mismatch) over configurable message decision result +- **Configurable message decision**: Named CEL params and a boolean result expression define the full decision logic (e.g., different max ages for ready vs not-ready) - **Self-healing**: Automatically retries without manual intervention - **Horizontal scalability**: Resource filtering allows multiple Sentinels to handle different resource subsets - **Event-driven architecture**: Maintains decoupling by publishing CloudEvents to message broker @@ -37,13 +37,13 @@ The Sentinel solves these problems by: - Service reads configuration from YAML files with environment variable overrides - Broker configuration separated and shared with adapters - Polls HyperFleet API for resources matching resource selector criteria -- **Decision Engine checks `resource.generation > resource.status.observed_generation` for immediate reconciliation** +- **Decision Engine checks `resource.generation > condition(resource, "Available").observed_generation` for immediate reconciliation** - **Generation mismatch triggers immediate event publication, regardless of max age intervals** -- Uses `status.last_updated_time` and `status.observed_generation` from adapter status updates +- Uses `status.conditions[].last_updated_time` and `status.conditions[].observed_generation` from adapter status updates - Creates CloudEvents for resources based on two-tier decision logic (generation first, then max age) - CloudEvent data structure is configurable via message_data field - Publishes events directly to message broker (GCP Pub/Sub or RabbitMQ) -- Configurable max age intervals (not-ready vs ready) +- Configurable message decision via CEL expressions (params + result) - Resource filtering support via label selectors in configuration - Metrics exposed for monitoring (reconciliation rate, event publishing, errors) - **Integration tests verify generation-based triggering takes priority over max age** @@ -75,32 +75,29 @@ Adapter fails transiently ```mermaid flowchart TD - Init([Service Startup]) --> ReadConfig[Load YAML Configuration
- max_age_not_ready: 10s
- max_age_ready: 30m
- resourceSelector: region=us-east
- message_data composition
+ Load broker config separately] + Init([Service Startup]) --> ReadConfig[Load YAML Configuration
- message_decision params + result
- resourceSelector: region=us-east
- message_data composition
+ Load broker config separately] - ReadConfig --> Validate{Configuration
Valid?} + ReadConfig --> CompileCEL[Compile CEL Expressions
- Resolve param dependencies
- Compile params and result
- Fail-fast on invalid expressions] + CompileCEL --> Validate{Configuration
Valid?} Validate -->|No| Exit[Exit with Error] Validate -->|Yes| StartLoop([Start Polling Loop]) - StartLoop --> FetchClusters[Fetch Clusters with Resource Selector
GET /api/hyperfleet/v1/clusters
?labels=region=us-east] + StartLoop --> FetchClusters[Fetch Resources with Resource Selector
GET /api/hyperfleet/v1/clusters
?labels=region=us-east] - FetchClusters --> ForEach{For Each Cluster} + FetchClusters --> ForEach{For Each Resource} - ForEach --> CheckGeneration{generation >
observed_generation?} + ForEach --> CheckGeneration{generation >
Available.observed_generation?} CheckGeneration -->|Yes - Spec Changed| PublishEvent[Create CloudEvent
Publish to Broker
Reason: generation changed] - CheckGeneration -->|No - Spec Stable| CheckReady{Cluster Status
== Ready?} + CheckGeneration -->|No - Spec Stable| EvalDecision[Evaluate message_decision
1. Eval params in dependency order
2. Eval result expression] - CheckReady -->|No - NOT Ready| CheckBackoffNotReady{last_updated_time + 10s
< now?} - CheckReady -->|Yes - Ready| CheckBackoffReady{last_updated_time + 30m
< now?} + EvalDecision --> CheckResult{result == true?} - CheckBackoffNotReady -->|Yes - Expired| PublishEventMaxAge[Create CloudEvent
Publish to Broker
Reason: max age expired] - CheckBackoffNotReady -->|No - Not Expired| Skip[Skip
Max age not expired] + CheckResult -->|Yes| PublishEventResult[Create CloudEvent
Publish to Broker
Reason: message decision matched] + CheckResult -->|No| Skip[Skip
Decision result is false] - CheckBackoffReady -->|Yes - Expired| PublishEventMaxAge - CheckBackoffReady -->|No - Not Expired| Skip - - PublishEvent --> NextCluster{More Clusters?} - PublishEventMaxAge --> NextCluster + PublishEvent --> NextCluster{More Resources?} + PublishEventResult --> NextCluster Skip --> NextCluster NextCluster -->|Yes| ForEach @@ -110,15 +107,17 @@ flowchart TD style Init fill:#d4edda style ReadConfig fill:#fff3cd + style CompileCEL fill:#fff3cd style Validate fill:#fff3cd style Exit fill:#f8d7da style StartLoop fill:#e1f5e1 style PublishEvent fill:#ffe1e1 - style PublishEventMaxAge fill:#ffe1e1 + style PublishEventResult fill:#ffe1e1 style Skip fill:#e1e5ff style FetchClusters fill:#fff4e1 style Sleep fill:#f0f0f0 style CheckGeneration fill:#d4f1f4 + style EvalDecision fill:#d4f1f4 ``` **Multiple Sentinel Deployments (Resource Filtering)**: @@ -209,56 +208,85 @@ This flexibility allows you to: ### Decision Logic -The service uses a two-tier decision logic that prioritizes spec changes over periodic checks: +The service uses a two-tier decision logic that prioritizes spec changes over the configurable message decision: **Publish Event IF** (evaluated in priority order): 1. **Generation Mismatch** (HIGHEST PRIORITY - immediate reconciliation): - - Resource generation > resource.status.observed_generation + - `resource.generation` > `condition(resource, "Available").observed_generation` - Reason: User changed the spec (e.g., scaled nodes), requires immediate reconciliation - - Example: Cluster generation=2, observed_generation=1 → publish immediately + - Example: Cluster generation=2, Available.observed_generation=1 → publish immediately -2. **Max Age Expired** (periodic health checks): - - Cluster status is NOT "Ready" AND max_age_not_ready interval expired (10 seconds default) - - OR Cluster status IS "Ready" AND max_age_ready interval expired (30 minutes default) +2. **Message Decision Result** (configurable decision logic): + - Evaluate all `message_decision.params` in dependency order (each param is a CEL expression or duration literal) + - Params can reference other params (e.g., `is_ready` can be used in `age_exceeded_ready`) + - Evaluate `message_decision.result` boolean expression (supports AND/OR operators) + - If result is `true` → publish event + - The result expression is the **sole decision maker** — all time-based checks, condition evaluations, and max age logic are encoded within the params **Skip IF**: -- Generation matches observed_generation AND max age not expired +- Generation matches `Available.observed_generation` AND message decision result is `false` **Key Insight - Generation-Based Reconciliation**: This implements the Kubernetes controller pattern where: - `resource.generation` tracks user's desired state version (increments when spec changes) -- `resource.status.observed_generation` tracks which generation was last reconciled by adapters +- `condition(resource, "Available").observed_generation` tracks which generation was last reconciled by adapters (the API aggregates this as `min(observed_generation)` across all required adapters) - Mismatch indicates new spec changes that require immediate reconciliation, regardless of max age intervals **Important**: Max age intervals are for periodic health checks when the spec is stable. Spec changes (generation increments) should trigger immediate reconciliation. -### Max Age Strategy (MVP Simple) +### Message Decision + +Rather than hardcoding which conditions to evaluate and what max age intervals to apply, the Sentinel uses a `message_decision` configuration with named **params** and a boolean **result** expression. + +**How It Works**: + +1. **Params** are named variables defined as CEL expressions or duration literals +2. Params can reference other params (evaluated in dependency order via topological sort) +3. The **result** expression combines params using `AND`/`OR` operators to produce a boolean +4. If result is `true`, a reconciliation event is published + +**Default Configuration** (equivalent to previous behavior): + +| Param Name | Type | Expression | Purpose | +|------------|------|------------|---------| +| `ref_time` | CEL → string | `conditionTime(resource, "Ready")` | Reference timestamp for age calculation | +| `is_ready` | CEL → bool | `status(resource, "Ready") == "True"` | Whether resource is ready | +| `age_exceeded_ready` | CEL → bool | `is_ready && now - timestamp(ref_time) > duration("30m")` | Ready resource with stale check | +| `age_exceeded_not_ready` | CEL → bool | `!is_ready && now - timestamp(ref_time) > duration("10s")` | Not-ready resource needing re-check | -The service uses two configurable max age intervals: +**Result**: `age_exceeded_ready OR age_exceeded_not_ready` -| Cluster State | Max Age Time | Reason | -|---------------|--------------|--------| -| NOT Ready | 10 seconds | Cluster being provisioned - check frequently | -| Ready | 30 minutes | Cluster stable - periodic health check | +**Key Design Decisions**: +- All CEL expressions are compiled at startup (fail-fast on invalid configuration) +- Params are evaluated in dependency order (topological sort); circular dependencies are rejected at startup +- Duration literals (e.g., `30m`, `10s`) in params are auto-detected and converted to CEL duration values +- The `now` variable (current timestamp) is available in all expressions +- The `result` is the **sole decision maker** — all time-based checks and condition evaluations are encoded in params +- `AND`/`OR` in the result are converted to `&&`/`||` for CEL evaluation +- Custom helper functions (`condition()`, `status()`, `conditionTime()`) are available for convenience +- This aligns with the adapter framework's preconditions pattern (CEL-based evaluation) **Configuration** (via YAML files): ```yaml # File: sentinel-config.yaml -# See hyperfleet/components/sentinel/sentinel-config.yaml for a complete example template -# Sentinel-specific configuration resource_type: clusters # Resource to watch: clusters, nodepools, manifests, workloads # Polling configuration poll_interval: 5s -max_age_not_ready: 10s # Max age when resource status != "Ready" -max_age_ready: 30m # Max age when resource status == "Ready" + +# Message decision - configurable decision logic +message_decision: + params: + ref_time: 'conditionTime(resource, "Ready")' + is_ready: 'status(resource, "Ready") == "True"' + age_exceeded_ready: 'is_ready && now - timestamp(ref_time) > duration("30m")' + age_exceeded_not_ready: '!is_ready && now - timestamp(ref_time) > duration("10s")' + result: age_exceeded_ready OR age_exceeded_not_ready # Resource selector - only process resources matching these labels -# Note: NOT true sharding, just label-based filtering -# Format: List of label/value pairs (AND logic - all must match) resource_selector: - label: region value: us-east @@ -267,7 +295,6 @@ resource_selector: hyperfleet_api: endpoint: http://hyperfleet-api.hyperfleet-system.svc.cluster.local:8080 timeout: 10s - # token: Override via HYPERFLEET_API_TOKEN="secret-token" # Message data composition - define CloudEvent data payload structure message_data: @@ -289,8 +316,6 @@ metadata: data: BROKER_TYPE: "pubsub" BROKER_PROJECT_ID: "hyperfleet-prod" - # Note: Sentinel publishes to topic (implicit default topic per project) - # Adapters use BROKER_SUBSCRIPTION_ID to consume --- # RabbitMQ Example: @@ -306,7 +331,6 @@ data: BROKER_VHOST: "/" BROKER_EXCHANGE: "hyperfleet-events" BROKER_EXCHANGE_TYPE: "fanout" - # Note: Sentinel publishes to exchange, Adapters consume from queues bound to this exchange ``` > **Note:** For topic naming conventions and multi-tenant isolation strategies, see [Naming Strategy](./sentinel-naming-strategy.md). @@ -382,12 +406,24 @@ The Sentinel uses multiple fields from the resource's status to make intelligent ```json { "id": "cls-123", - "generation": 2, // User's desired state version (increments on spec changes) + "generation": 2, "status": { - "phase": "Provisioning", - "observed_generation": 1, // Which generation was last reconciled by adapters - "last_transition_time": "2025-10-21T10:00:00Z", // When status changed to "Provisioning" - "last_updated_time": "2025-10-21T12:00:00Z" // When adapter last checked this resource + "conditions": [ + { + "type": "Available", + "status": "True", + "observed_generation": 1, + "last_updated_time": "2025-10-21T12:00:00Z", + "last_transition_time": "2025-10-21T10:00:00Z" + }, + { + "type": "Ready", + "status": "False", + "observed_generation": 2, + "last_updated_time": "2025-10-21T12:00:00Z", + "last_transition_time": "2025-10-21T10:00:00Z" + } + ] } } ``` @@ -396,17 +432,17 @@ The Sentinel uses multiple fields from the resource's status to make intelligent - **`generation`**: User's desired state version. Increments when the resource spec changes (e.g., user scales nodes from 3 to 5). This is the "what the user wants" field. -- **`status.observed_generation`**: Which generation was last reconciled by adapters. Updated by adapters when they successfully process a resource. This is the "what we've reconciled" field. +- **`condition.observed_generation`**: Which generation was last reconciled. The `Available` condition's `observed_generation` is computed by the API as `min(observed_generation)` across all required adapters — it represents the most conservative view of what has been reconciled. -- **`last_transition_time`**: Updates ONLY when the status.phase changes (e.g., Provisioning → Ready) +- **`condition.last_transition_time`**: Updates ONLY when the condition status changes (e.g., Ready False → True) -- **`last_updated_time`**: Updates EVERY time an adapter checks the resource, regardless of whether status changed +- **`condition.last_updated_time`**: Updates EVERY time an adapter checks the resource, regardless of whether status changed **Why generation/observed_generation matters for reconciliation:** When a user changes the cluster spec (e.g., scales nodes), `generation` increments (1 → 2). The Sentinel compares: -- If `generation > observed_generation` (e.g., 2 > 1): **User made changes that haven't been reconciled yet** → Publish event immediately -- If `generation == observed_generation` (e.g., 2 == 2): **Spec is stable, reconciliation is current** → Use max age intervals for periodic health checks +- If `generation > Available.observed_generation` (e.g., 2 > 1): **User made changes that haven't been reconciled yet** → Publish event immediately +- If `generation == Available.observed_generation` (e.g., 2 == 2): **Spec is stable, reconciliation is current** → Use max age intervals for periodic health checks This implements the Kubernetes controller pattern and ensures: 1. **Responsive reconciliation**: Spec changes trigger immediate events (no waiting 30 minutes) @@ -418,7 +454,7 @@ This implements the Kubernetes controller pattern and ensures: If a cluster stays in "Provisioning" state for 2 hours, `last_transition_time` would remain at the time it entered "Provisioning" (e.g., 10:00), even though adapters check it at 11:00, 11:30, 12:00. Using `last_transition_time` for max age calculation would incorrectly trigger events too frequently. Using `last_updated_time` ensures max age is calculated from the last adapter check, not the last status change. **For complete details on generation and observed_generation semantics, see:** -- [HyperFleet Status Guide](../../docs/status-guide.md) - Complete documentation of the status contract, including how adapters report `observed_generation` +- [HyperFleet Status Guide](../../docs/status-guide.md) - Complete documentation of the status contract, including how adapters report `observed_generation` and how the API aggregates it into `Available.observed_generation` ### Resource Filtering Architecture @@ -451,8 +487,13 @@ If a cluster stays in "Provisioning" state for 2 hours, `last_transition_time` w # Deployment 1: US East clusters resource_type: clusters poll_interval: 5s -max_age_not_ready: 10s -max_age_ready: 30m +message_decision: + params: + ref_time: 'conditionTime(resource, "Ready")' + is_ready: 'status(resource, "Ready") == "True"' + age_exceeded_ready: 'is_ready && now - timestamp(ref_time) > duration("30m")' + age_exceeded_not_ready: '!is_ready && now - timestamp(ref_time) > duration("10s")' + result: age_exceeded_ready OR age_exceeded_not_ready resource_selector: - label: region value: us-east @@ -461,21 +502,23 @@ hyperfleet_api: endpoint: http://hyperfleet-api.hyperfleet-system.svc.cluster.local:8080 timeout: 10s -# Message data composition message_data: resource_id: .id resource_type: .kind region: .metadata.labels.region -# Note: Broker config is in separate sentinel-broker-config.yaml ConfigMap - --- # File: sentinel-us-west-config.yaml -# Deployment 2: US West clusters (different config!) +# Deployment 2: US West clusters (different intervals!) resource_type: clusters poll_interval: 5s -max_age_not_ready: 15s # Different max age! -max_age_ready: 1h # Different max age! +message_decision: + params: + ref_time: 'conditionTime(resource, "Ready")' + is_ready: 'status(resource, "Ready") == "True"' + age_exceeded_ready: 'is_ready && now - timestamp(ref_time) > duration("1h")' # Different! + age_exceeded_not_ready: '!is_ready && now - timestamp(ref_time) > duration("15s")' # Different! + result: age_exceeded_ready OR age_exceeded_not_ready resource_selector: - label: region value: us-west @@ -494,9 +537,13 @@ message_data: # Future: NodePool Sentinel (different resource type!) resource_type: nodepools poll_interval: 5s -max_age_not_ready: 5s -max_age_ready: 10m -# resource_selector: [] # Watch all node pools (empty list matches all) +message_decision: + params: + ref_time: 'conditionTime(resource, "Ready")' + is_ready: 'status(resource, "Ready") == "True"' + age_exceeded_ready: 'is_ready && now - timestamp(ref_time) > duration("10m")' + age_exceeded_not_ready: '!is_ready && now - timestamp(ref_time) > duration("5s")' + result: age_exceeded_ready OR age_exceeded_not_ready hyperfleet_api: endpoint: http://hyperfleet-api.hyperfleet-system.svc.cluster.local:8080 @@ -510,7 +557,6 @@ message_data: --- # File: sentinel-broker-config.yaml (Same across all Sentinel deployments) # Choose one of the following based on your environment: -# Note: Adapters have their own broker ConfigMap with different fields # Google Cloud Pub/Sub: apiVersion: v1 @@ -554,7 +600,9 @@ data: **Implementation Requirements**: - Load Sentinel configuration from YAML file path specified via command-line flag -- Parse duration strings (max_age_not_ready, max_age_ready, poll_interval, timeout) +- Parse duration strings (poll_interval, timeout) +- Parse `message_decision` section: params (CEL expressions or duration literals) and result expression +- Resolve param dependencies (topological sort) and compile all CEL expressions at startup for fail-fast validation - Parse resource_type field to determine which HyperFleet resources to fetch - Parse message_data configuration for composable CloudEvent data structure - Load broker configuration separately (from environment variables or shared ConfigMap) @@ -574,36 +622,35 @@ The Resource Watcher uses the API's condition-based search to selectively query ### 3. Decision Engine -**Responsibility**: Generation-aware decision logic with time-based fallback +**Responsibility**: Generation-aware decision logic with configurable message decision **Key Functions**: - `Evaluate(resource, now)` - Determine if resource needs an event **Decision Logic** (evaluated in priority order): 1. **Check for generation mismatch** (HIGHEST PRIORITY): - - Compare `resource.generation` with `resource.status.conditions.Ready.observed_generation` - - If `resource.generation > resource.status.conditions.Ready.observed_generation`: + - Compare `resource.generation` with `condition(resource, "Available").observed_generation` + - If `resource.generation > condition(resource, "Available").observed_generation`: - Return: `{ShouldPublish: true, Reason: "generation changed - new spec to reconcile"}` - This ensures immediate reconciliation when users change the spec -2. **Check resource status conditions** (fallback to max age intervals): - - Select appropriate max age interval based on the `Ready` condition: - - If `status.conditions.Ready='True'` → use `max_age_ready` (30 minutes) - - If `status.conditions.Ready='False'` → use `max_age_not_ready` (10 seconds) - -3. **Check if max age expired**: - - Get `status.conditions.Ready.last_updated_time` (updated by adapters every time they report status) - - Calculate `nextEventTime = last_updated_time + max_age` - - If `now >= nextEventTime` → publish event - - Otherwise → skip (max age not expired) +2. **Evaluate message decision** (configurable decision logic): + - Evaluate all `message_decision.params` in dependency order, building an activation map + - Duration literal params (e.g., `30m`) are converted to CEL duration values + - CEL expression params are evaluated with access to `resource`, `now`, and previously evaluated params + - Evaluate `message_decision.result` boolean expression with all params in scope + - `AND`/`OR` operators in result are converted to `&&`/`||` for CEL evaluation -4. **Return decision with reason for logging** +3. **Return decision based on result**: + - If result is `true` → publish event (reason: "message decision matched") + - If result is `false` → skip (reason: "message decision not matched") **Implementation Requirements**: -- Priority-based decision logic: generation check first, then max age -- Use `resource.generation` and `resource.status.conditions.Ready.observed_generation` for spec change detection -- Use `status.conditions.Ready.last_updated_time` from adapter status updates (NOT `last_transition_time`) for max age calculations -- Clear logging of decision reasoning (which condition triggered the event) +- Priority-based decision logic: generation check first, then message decision result +- All CEL expressions compiled at startup (fail-fast on invalid expressions) +- Param dependencies resolved via topological sort; circular dependencies rejected at startup +- Use `resource.generation` and `condition(resource, "Available").observed_generation` for spec change detection +- Clear logging of decision reasoning ### 4. Message Publisher @@ -716,7 +763,7 @@ message_data: 1. **Load Configuration**: - Load Sentinel configuration from YAML file specified via command-line flag - Load broker configuration from environment or shared ConfigMap - - Parse max age intervals, resource selector, message_data, and resource type + - Parse message_decision params/result, resource selector, message_data, and resource type - Apply environment variable overrides for sensitive fields - Initialize MessagePublisher with broker config - Log configuration details and validate all required fields @@ -746,7 +793,7 @@ message_data: - Repeat the loop **Service Architecture**: -- **Single-phase initialization**: Load configuration once during startup, fail fast if invalid +- **Single-phase initialization**: Load configuration once during startup, resolve param dependencies, compile CEL expressions, fail fast if invalid - **Stateless polling loop**: No configuration reloading during runtime - **Simple service model**: No Kubernetes controller pattern, just periodic polling - **Graceful shutdown**: Support clean termination on SIGTERM/SIGINT (see [Graceful Shutdown Standard](../../standards/graceful-shutdown.md)) @@ -760,90 +807,88 @@ message_data: ## Decision Engine Test Scenarios -The following test scenarios ensure the Decision Engine correctly implements generation-based reconciliation and max age behavior: +The following test scenarios ensure the Decision Engine correctly implements generation-based reconciliation and message decision behavior: ### Generation-Based Reconciliation Tests -**Test 1: Ready cluster with generation mismatch → publish immediately** +**Test 1: Ready resource with generation mismatch → publish immediately** ``` Given: - - Cluster status.phase: Ready - - cluster.generation = 2 - - cluster.status.observed_generation = 1 - - cluster.status.last_updated_time = now() - 5s - - max_age_ready = 30m + - Resource Ready condition status: True + - resource.generation = 2 + - Available.observed_generation = 1 Then: - Decision: PUBLISH - Reason: "generation changed - new spec to reconcile" - - Max age NOT checked (generation takes priority) + - Message decision NOT evaluated (generation takes priority) ``` -**Test 2: Ready cluster with generation match → wait for max age** +**Test 2: Ready resource with generation match → message decision result is false** ``` Given: - - Cluster status.phase: Ready - - cluster.generation = 2 - - cluster.status.observed_generation = 2 - - cluster.status.last_updated_time = now() - 5m - - max_age_ready = 30m + - Resource Ready condition status: True + - resource.generation = 2 + - Available.observed_generation = 2 + - conditionTime("Ready") = now() - 5m (age < 30m) Then: - Decision: SKIP - - Reason: "max age not expired" - - Next event: now() + 25m + - Reason: "message decision not matched" + - Params evaluated: ref_time, is_ready=true, age_exceeded_ready=false, age_exceeded_not_ready=false + - Result: false OR false = false ``` -**Test 3: Not-Ready cluster with generation mismatch → publish immediately** +**Test 3: Not-Ready resource with generation mismatch → publish immediately** ``` Given: - - Cluster status.phase: NotReady - - cluster.generation = 3 - - cluster.status.observed_generation = 2 - - cluster.status.last_updated_time = now() - 2s - - max_age_not_ready = 10s + - Resource Ready condition status: False + - resource.generation = 3 + - Available.observed_generation = 2 Then: - Decision: PUBLISH - Reason: "generation changed - new spec to reconcile" - - Max age NOT checked (generation takes priority) + - Message decision NOT evaluated (generation takes priority) ``` -**Test 4: Not-Ready cluster with generation match and max age expired → publish** +**Test 4: Not-Ready resource with generation match and age exceeded → publish** ``` Given: - - Cluster status.phase: NotReady - - cluster.generation = 1 - - cluster.status.observed_generation = 1 - - cluster.status.last_updated_time = now() - 15s - - max_age_not_ready = 10s + - Resource Ready condition status: False + - resource.generation = 1 + - Available.observed_generation = 1 + - conditionTime("Ready") = now() - 15s (age > 10s) Then: - Decision: PUBLISH - - Reason: "max age expired (not ready)" + - Reason: "message decision matched" + - Params evaluated: ref_time, is_ready=false, age_exceeded_ready=false, age_exceeded_not_ready=true + - Result: false OR true = true ``` -**Test 5: Not-Ready cluster with generation match and max age not expired → skip** +**Test 5: Not-Ready resource with generation match and age not exceeded → skip** ``` Given: - - Cluster status.phase: NotReady - - cluster.generation = 1 - - cluster.status.observed_generation = 1 - - cluster.status.last_updated_time = now() - 5s - - max_age_not_ready = 10s + - Resource Ready condition status: False + - resource.generation = 1 + - Available.observed_generation = 1 + - conditionTime("Ready") = now() - 5s (age < 10s) Then: - Decision: SKIP - - Reason: "max age not expired" - - Next event: now() + 5s + - Reason: "message decision not matched" + - Params evaluated: ref_time, is_ready=false, age_exceeded_ready=false, age_exceeded_not_ready=false + - Result: false OR false = false ``` -**Test 6: Ready cluster with generation match and max age expired → publish** +**Test 6: Ready resource with generation match and age exceeded → publish** ``` Given: - - Cluster status.phase: Ready - - cluster.generation = 1 - - cluster.status.observed_generation = 1 - - cluster.status.last_updated_time = now() - 31m - - max_age_ready = 30m + - Resource Ready condition status: True + - resource.generation = 1 + - Available.observed_generation = 1 + - conditionTime("Ready") = now() - 31m (age > 30m) Then: - Decision: PUBLISH - - Reason: "max age expired (ready)" + - Reason: "message decision matched" + - Params evaluated: ref_time, is_ready=true, age_exceeded_ready=true, age_exceeded_not_ready=false + - Result: true OR false = true ``` ### Edge Cases @@ -851,48 +896,79 @@ Then: **Test 7: observed_generation ahead of generation (should not happen, but handle gracefully)** ``` Given: - - Cluster status.phase: Ready - - cluster.generation = 1 - - cluster.status.observed_generation = 2 + - Resource Ready condition status: True + - resource.generation = 1 + - Available.observed_generation = 2 Then: - - Decision: SKIP (treat as match) + - Decision: evaluate message_decision (treat as generation match) - Log warning: "observed_generation ahead of generation - potential API issue" ``` **Test 8: Missing observed_generation (initial state)** ``` Given: - - Cluster status.phase: NotReady - - cluster.generation = 1 - - cluster.status.observed_generation = 0 (or nil) - - cluster.status.last_updated_time = now() - 2s - - max_age_not_ready = 10s + - Resource Ready condition status: False + - resource.generation = 1 + - Available.observed_generation = 0 (or nil) Then: - Decision: PUBLISH - Reason: "generation changed - new spec to reconcile" ``` +**Test 9: Param evaluation failure (condition not found on resource)** +``` +Given: + - Resource has no Ready condition + - resource.generation = 1 + - Available.observed_generation = 1 +Then: + - Param evaluation fails (ref_time returns error) + - Decision: PUBLISH (fail-safe — publish on evaluation error) +``` + +**Test 10: CEL expression compilation failure at startup** +``` +Given: + - Configuration contains invalid CEL expression in params or result +Then: + - Sentinel exits with error at startup (fail-fast) + - Clear error message indicating which param/expression failed +``` + +**Test 11: Circular param dependency at startup** +``` +Given: + - Param A references Param B, and Param B references Param A +Then: + - Sentinel exits with error at startup (fail-fast) + - Clear error message indicating the circular dependency +``` + ### Test Requirements **Unit Tests** (Decision Engine): The Decision Engine logic should be tested with unit tests covering: -- All decision paths: generation check → max age check → skip -- All reasons logged correctly (which condition triggered the event) +- All decision paths: generation check → param evaluation → result evaluation → skip +- CEL expression compilation (valid and invalid expressions) +- CEL custom function behavior (condition, status, conditionTime) +- Param dependency resolution (topological sort, cycle detection) +- Duration literal detection and conversion +- Result expression AND/OR conversion +- Param evaluation failure handling (fail-safe publish) - Edge cases handled gracefully (observed_generation ahead, missing, etc.) -- 100% code coverage on the Decision Engine package **Integration Tests** (End-to-End): Integration tests should verify the complete Sentinel workflow: -1. **Event Publishing**: Sentinel successfully publishes CloudEvents to the message broker when decision logic indicates an event should be published +1. **Event Publishing**: Sentinel successfully publishes CloudEvents to the message broker when message decision result is true -2. **Generation-based triggering priority**: When a cluster spec changes (generation increments), Sentinel publishes an event immediately regardless of max age intervals +2. **Generation-based triggering priority**: When a resource spec changes (generation increments), Sentinel publishes an event immediately regardless of message decision -3. **Max age intervals**: When generation matches observed_generation, Sentinel respects max age intervals before publishing the next event +3. **Message decision evaluation**: When generation matches `Available.observed_generation`, Sentinel evaluates message_decision params and result to determine whether to publish -4. **Adapter feedback loop**: Adapters receive events, process resources, and update observed_generation correctly, which Sentinel reads in subsequent polls +4. **Adapter feedback loop**: Adapters receive events, process resources, and update `observed_generation` correctly, which the API aggregates into `Available.observed_generation` for Sentinel to read in subsequent polls --- From eed076752b41cf791ee6fdf955f165f24ca0c0f9 Mon Sep 17 00:00:00 2001 From: Rafael Benevides Date: Tue, 17 Mar 2026 16:29:57 -0300 Subject: [PATCH 2/3] HYPERFLEET-537 - docs: remove sentinel-config.yaml example file Remove standalone config file example from architecture repo to avoid sync issues. Config examples remain inline in sentinel.md and architecture-summary.md. Detailed config lives in the sentinel repo. --- .../components/sentinel/sentinel-config.yaml | 72 ------------------- 1 file changed, 72 deletions(-) delete mode 100644 hyperfleet/components/sentinel/sentinel-config.yaml diff --git a/hyperfleet/components/sentinel/sentinel-config.yaml b/hyperfleet/components/sentinel/sentinel-config.yaml deleted file mode 100644 index 45d2d79..0000000 --- a/hyperfleet/components/sentinel/sentinel-config.yaml +++ /dev/null @@ -1,72 +0,0 @@ -# Sentinel Configuration -# Simple YAML configuration for HyperFleet Sentinel service -# This replaces the previous CRD-based configuration approach -# -# Note: Broker configuration is in a separate file (broker-config.yaml) -# to allow sharing the same broker config between Sentinel and Adapters - -# === RESOURCE MONITORING === -resource_type: clusters # Resource to watch: clusters, nodepools, manifests, workloads - -# === POLLING CONFIGURATION === -poll_interval: 5s # How often to check the API - -# === MESSAGE DECISION === -# Configurable decision logic that determines when to publish reconciliation events. -# Uses named params (CEL expressions or duration literals) and a boolean result expression. -# Params can reference other params (evaluated in dependency order). -# The result expression combines params with AND/OR operators. -# All expressions are compiled at startup; invalid expressions cause fail-fast. -# -# Available CEL variables: -# resource - the resource map (id, kind, status, status.conditions, etc.) -# now - current timestamp -# -# Available CEL helper functions: -# condition(resource, type) → map - returns full condition map for a given type -# status(resource, type) → string - returns the status string of a condition -# conditionTime(resource, type) → string - returns last_updated_time (RFC3339) -# -# Duration literals (e.g., 30m, 10s) are auto-detected and made available as CEL durations. -message_decision: - params: - ref_time: 'conditionTime(resource, "Ready")' - is_ready: 'status(resource, "Ready") == "True"' - age_exceeded_ready: 'is_ready && now - timestamp(ref_time) > duration("30m")' - age_exceeded_not_ready: '!is_ready && now - timestamp(ref_time) > duration("10s")' - result: age_exceeded_ready OR age_exceeded_not_ready - -# === RESOURCE SELECTOR (optional) === -# Label selector for filtering which resources this Sentinel instance monitors -# Note: This is NOT true sharding - there is no component ensuring that all -# sentinels collectively select all resources without gaps or overlaps. -# Multiple Sentinels can watch overlapping or different resource sets. -# -# Format: List of label/value pairs -# Each entry specifies a label key and value that resources must match -# All labels specified must match (AND logic) -resource_selector: - - label: region - value: us-east - -# === HYPERFLEET API === -hyperfleet_api: - endpoint: https://api.hyperfleet.example.com - timeout: 30s - # token: Override via HYPERFLEET_API_TOKEN="secret-token" - -# === MESSAGE DATA COMPOSITION === -# Defines how to construct the CloudEvent data payload from resource fields -# This allows Sentinel to be generic across different resource types -# Uses Go template syntax with dot notation for field access -# -# Template supports dot notation for nested fields: -# - .id → Direct field from resource -# - .ownerResource.id → Nested field -# - .metadata.labels.region → Label value -message_data: - resource_id: .id - resource_type: .kind - cluster_id: .ownerResource.id # For nodepools: link to parent cluster - generation: .generation # Track resource version for stale event handling - region: .metadata.labels.region \ No newline at end of file From 5a0836cf63c7e92651d07fc36789ec6e3c6c7fcd Mon Sep 17 00:00:00 2001 From: Rafael Benevides Date: Tue, 17 Mar 2026 16:33:17 -0300 Subject: [PATCH 3/3] HYPERFLEET-537 - docs: remove sentinel-deployment.md to avoid drift Remove deployment doc from architecture repo. Operational details live in the sentinel repo's operator guide. Update references in metrics.md, health-endpoints.md, and sentinel.md to point there. --- .../sentinel/sentinel-deployment.md | 233 ------------------ hyperfleet/components/sentinel/sentinel.md | 9 +- hyperfleet/standards/health-endpoints.md | 2 +- hyperfleet/standards/metrics.md | 2 +- 4 files changed, 3 insertions(+), 243 deletions(-) delete mode 100644 hyperfleet/components/sentinel/sentinel-deployment.md diff --git a/hyperfleet/components/sentinel/sentinel-deployment.md b/hyperfleet/components/sentinel/sentinel-deployment.md deleted file mode 100644 index fa895a0..0000000 --- a/hyperfleet/components/sentinel/sentinel-deployment.md +++ /dev/null @@ -1,233 +0,0 @@ -# Sentinel Service Deployment - -This document provides Kubernetes deployment manifests and configuration examples for the HyperFleet Sentinel service. - -For the main Sentinel architecture and design documentation, see [sentinel.md](./sentinel.md). - ---- - -## Kubernetes Deployment (Single Replica, No Leader Election) - -```yaml -apiVersion: apps/v1 -kind: Deployment -metadata: - name: cluster-sentinel - namespace: hyperfleet-system - labels: - app: cluster-sentinel - app.kubernetes.io/name: hyperfleet-sentinel - sentinel.hyperfleet.io/resource-type: clusters -spec: - replicas: 1 # Single replica per resource selector - selector: - matchLabels: - app: cluster-sentinel - template: - metadata: - labels: - app: cluster-sentinel - spec: - serviceAccountName: hyperfleet-sentinel - terminationGracePeriodSeconds: 30 - containers: - - name: sentinel - image: quay.io/hyperfleet/sentinel:v1.0.0 - imagePullPolicy: IfNotPresent - command: - - /sentinel - args: - - --config=/etc/sentinel/config.yaml # Path to YAML config file - - --metrics-bind-address=:9090 - - --health-probe-bind-address=:8080 - envFrom: - # Broker configuration (BROKER_TYPE, BROKER_PROJECT_ID, BROKER_HOST, etc.) - - configMapRef: - name: hyperfleet-sentinel-broker - env: - # HYPERFLEET_API_TOKEN="secret-token" - - name: HYPERFLEET_API_TOKEN - valueFrom: - secretKeyRef: - name: sentinel-secrets - key: api-token - # GCP_PROJECT_ID="production-project" - - name: GCP_PROJECT_ID - valueFrom: - configMapKeyRef: - name: sentinel-config - key: gcp-project-id - # BROKER_CREDENTIALS="path/to/credentials.json" (only if broker requires secret credentials) - - name: BROKER_CREDENTIALS - valueFrom: - secretKeyRef: - name: sentinel-secrets - key: broker-credentials - volumeMounts: - - name: config - mountPath: /etc/sentinel - readOnly: true - - name: gcp-credentials - mountPath: /var/secrets/google - readOnly: true - ports: - - containerPort: 9090 - name: metrics - protocol: TCP - - containerPort: 8080 - name: health - protocol: TCP - livenessProbe: - httpGet: - path: /healthz - port: health - initialDelaySeconds: 15 - periodSeconds: 20 - readinessProbe: - httpGet: - path: /readyz - port: health - initialDelaySeconds: 5 - periodSeconds: 10 - resources: - requests: - cpu: 50m - memory: 64Mi - limits: - cpu: 100m - memory: 128Mi - volumes: - - name: config - configMap: - name: sentinel-config - - name: gcp-credentials - secret: - secretName: gcp-pubsub-credentials -``` - ---- - -## ServiceAccount and RBAC - -```yaml -apiVersion: v1 -kind: ServiceAccount -metadata: - name: hyperfleet-sentinel - namespace: hyperfleet-system ---- -# ConfigMap with sentinel configuration -apiVersion: v1 -kind: ConfigMap -metadata: - name: sentinel-config - namespace: hyperfleet-system -data: - config.yaml: | - resource_type: clusters - poll_interval: 5s - - message_decision: - params: - ref_time: 'conditionTime(resource, "Ready")' - is_ready: 'status(resource, "Ready") == "True"' - age_exceeded_ready: 'is_ready && now - timestamp(ref_time) > duration("30m")' - age_exceeded_not_ready: '!is_ready && now - timestamp(ref_time) > duration("10s")' - result: age_exceeded_ready OR age_exceeded_not_ready - - resource_selector: - - label: region - value: us-east - - hyperfleet_api: - endpoint: http://hyperfleet-api.hyperfleet-system.svc.cluster.local:8080 - timeout: 30s - - message_data: - resource_id: .id - resource_type: .kind - generation: .generation - region: .metadata.labels.region - gcp-project-id: "hyperfleet-prod" ---- -# Broker configuration (Sentinel-specific) -# Choose one based on your environment: - -# Google Cloud Pub/Sub: -apiVersion: v1 -kind: ConfigMap -metadata: - name: hyperfleet-sentinel-broker - namespace: hyperfleet-system -data: - BROKER_TYPE: "pubsub" - BROKER_PROJECT_ID: "hyperfleet-prod" - ---- -# RabbitMQ: -apiVersion: v1 -kind: ConfigMap -metadata: - name: hyperfleet-sentinel-broker - namespace: hyperfleet-system -data: - BROKER_TYPE: "rabbitmq" - BROKER_HOST: "rabbitmq.hyperfleet-system.svc.cluster.local" - BROKER_PORT: "5672" - BROKER_VHOST: "/" - BROKER_EXCHANGE: "hyperfleet-events" - BROKER_EXCHANGE_TYPE: "fanout" - ---- -# Secret with sensitive configuration -apiVersion: v1 -kind: Secret -metadata: - name: sentinel-secrets - namespace: hyperfleet-system -type: Opaque -data: - api-token: - broker-credentials: -``` - -**Note**: No RBAC needed since the service only reads configuration from mounted ConfigMap and Secret volumes. No Kubernetes API access required. - -**Broker Configuration**: Sentinel uses a separate `hyperfleet-sentinel-broker` ConfigMap. Adapters have their own broker ConfigMap (`hyperfleet-adapter-broker`) with different fields: -- **Sentinel** (publisher): Uses BROKER_TOPIC, BROKER_EXCHANGE to publish events -- **Adapters** (consumers): Use BROKER_SUBSCRIPTION_ID, BROKER_QUEUE_NAME to consume events -- **Common fields** (BROKER_TYPE, BROKER_PROJECT_ID, BROKER_HOST) are duplicated in both ConfigMaps for simplicity - -> **Note:** For topic naming conventions and multi-tenant isolation strategies, see [Naming Strategy](./sentinel-naming-strategy.md). - ---- - -## Metrics and Observability - -### Prometheus Metrics - -The Sentinel service must expose the following Prometheus metrics: - -| Metric Name | Type | Labels | Description | -|-------------|------|--------|-------------| -| `hyperfleet_sentinel_pending_resources` | Gauge | `component`, `version`, `resource_selector`, `resource_type` | Number of resources matching this selector | -| `hyperfleet_sentinel_events_published_total` | Counter | `component`, `version`, `resource_selector`, `resource_type` | Total number of events published to broker | -| `hyperfleet_sentinel_resources_skipped_total` | Counter | `component`, `version`, `resource_selector`, `resource_type`, `reason` | Total number of resources skipped (matched rule name or "no_match" fallback) | -| `hyperfleet_sentinel_poll_duration_seconds` | Histogram | `component`, `version`, `resource_selector`, `resource_type` | Time spent in each polling cycle | -| `hyperfleet_sentinel_api_errors_total` | Counter | `component`, `version`, `resource_selector`, `resource_type`, `operation` | Total API errors by operation (fetch_resources, config_load) | -| `hyperfleet_sentinel_broker_errors_total` | Counter | `component`, `version`, `resource_selector`, `resource_type`, `broker_type` | Total broker publishing errors | -| `hyperfleet_sentinel_config_loads_total` | Counter | `component`, `version`, `resource_selector`, `resource_type` | Total configuration loads at startup | - -**Implementation Requirements**: -- Use standard Prometheus Go client library -- All metrics must include `component` and `version` labels (see [Metrics Standard](../../standards/metrics.md)) -- All metrics must include `resource_selector` label (from resource_selector string) -- All metrics must include `resource_type` label (from configuration resource_type field) -- `reason` label values: decision reason (e.g., "generation_mismatch", "message_decision" for published; param-specific reasons for skipped) -- `operation` label values: "fetch_resources", "config_load" -- `broker_type` label values: "pubsub", "rabbitmq" -- Expose metrics endpoint on port 9090 at `/metrics` - -For complete health and readiness endpoint standards, see [Health Endpoints Specification](../../standards/health-endpoints.md). - -For cross-component metrics conventions, see [HyperFleet Metrics Standard](../../standards/metrics.md). diff --git a/hyperfleet/components/sentinel/sentinel.md b/hyperfleet/components/sentinel/sentinel.md index 40bb4a2..4b52750 100644 --- a/hyperfleet/components/sentinel/sentinel.md +++ b/hyperfleet/components/sentinel/sentinel.md @@ -974,14 +974,7 @@ Integration tests should verify the complete Sentinel workflow: ## Service Deployment -For complete Kubernetes deployment manifests, configuration examples, and observability setup, see [sentinel-deployment.md](./sentinel-deployment.md). - -The deployment documentation includes: -- Kubernetes Deployment manifests -- ServiceAccount and RBAC configuration -- ConfigMap examples for Sentinel and broker configuration -- Prometheus metrics specification -- Health probe configuration +For complete Kubernetes deployment manifests, configuration examples, and observability setup, see the [Sentinel Operator Guide](https://github.com/openshift-hyperfleet/hyperfleet-sentinel/blob/main/docs/sentinel-operator-guide.md) in the sentinel repository. For logging configuration standards, see [Logging Specification](../../standards/logging-specification.md). diff --git a/hyperfleet/standards/health-endpoints.md b/hyperfleet/standards/health-endpoints.md index d4ffd7e..fca107d 100644 --- a/hyperfleet/standards/health-endpoints.md +++ b/hyperfleet/standards/health-endpoints.md @@ -138,7 +138,7 @@ Or on failure: **Required Metrics**: See component-specific documentation: -- [Sentinel Deployment](../components/sentinel/sentinel-deployment.md) - Sentinel metrics +- [Sentinel Operator Guide](https://github.com/openshift-hyperfleet/hyperfleet-sentinel/blob/main/docs/sentinel-operator-guide.md) - Sentinel metrics - [Adapter Metrics](../components/adapter/framework/adapter-metrics.md) - Adapter metrics --- diff --git a/hyperfleet/standards/metrics.md b/hyperfleet/standards/metrics.md index 7ff6f4d..cd4fe10 100644 --- a/hyperfleet/standards/metrics.md +++ b/hyperfleet/standards/metrics.md @@ -230,7 +230,7 @@ All metrics MUST be compatible with OpenMetrics format. The Prometheus Go client For detailed metrics definitions per component, see: -- **Sentinel**: [Sentinel Deployment](../components/sentinel/sentinel-deployment.md#metrics-and-observability) +- **Sentinel**: [Sentinel Operator Guide](https://github.com/openshift-hyperfleet/hyperfleet-sentinel/blob/main/docs/sentinel-operator-guide.md) - **Adapters**: [Adapter Metrics](../components/adapter/framework/adapter-metrics.md) ---