Hyperfleet-557 : Document Sentinel Reliability and Observability by tirthct · Pull Request #73 · openshift-hyperfleet/hyperfleet-sentinel

tirthct · 2026-03-12T18:27:07Z

Changed

Add documentation for Sentinel reliability and observability

…servability

openshift-ci · 2026-03-12T18:27:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yasun1 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rafabene · 2026-03-12T18:50:14Z

docs/runbook.md

+
+**Implementation**:
+
+**Liveness Probe** (`/healthz`):


The health check descriptions here don't match the actual implementation:

/healthz: The code (internal/health/health.go) only checks poll staleness (time since last
successful poll vs 3 * PollInterval). It does not check broker connection status — that check
is registered on /readyz. Suggestion:

- Checks poll staleness (dead man's switch) - Returns 200 OK if last successful poll is within threshold (3 × poll_interval) - Returns 200 OK before first poll completes (grace period)

Fixed the health check description

rafabene · 2026-03-12T18:50:46Z

docs/runbook.md

+- **Failure threshold**: 3 consecutive failures
+- **Period**: 20 seconds
+
+**Readiness Probe** (`/readyz`):


/readyz: The code only registers two checks: "broker" (calls pub.Health(ctx)) and
"sentinel_poll" (verifies at least one successful poll). There is no explicit "configuration
loaded" or "API connectivity" check. Suggestion:

- Checks broker connection health - Verifies at least one successful poll cycle has completed - Returns 200 OK when both checks pass

Ah, yes. Modified the description

rafabene · 2026-03-12T18:52:16Z

docs/runbook.md

+
+**What**: Sentinel responds to SIGTERM/SIGINT signals with controlled shutdown.
+
+**Implementation**:


Priority: Bug

Line 46: "Publishes any pending events before shutdown" — this doesn't match the shutdown
implementation. The signal handler calls cancel() on the context immediately, then shuts down
the HTTP servers with a 20s timeout. There is no explicit event flush/drain step.

If a polling cycle is mid-execution when shutdown is triggered, in-flight publishes may
complete (since cancel() doesn't interrupt running goroutines mid-function), but they may also
fail because downstream calls receive a cancelled context. This is best-effort, not
guaranteed.

Suggested replacement:

- Listens for termination signals during main polling loop - If a polling cycle is in progress, in-flight operations may complete (best-effort) - Maximum shutdown time: 20 seconds for HTTP server shutdown - Cleans up broker connections gracefully

Removed "Publishes any pending events before shutdown"

docs/runbook.md

docs/alerts.md

rafabene · 2026-03-12T18:58:34Z

docs/alerts.md

+    - alert: SentinelNoEventsPublished
+      expr: |
+        rate(hyperfleet_sentinel_events_published_total[15m]) == 0
+        AND hyperfleet_sentinel_pending_resources > 0


Suggested change

AND hyperfleet_sentinel_pending_resources > 0

rafabene · 2026-03-12T18:59:53Z

docs/alerts.md

+
+## Alert Rules Reference
+
+The following 8 alert rules provide comprehensive monitoring for production Sentinel deployments.


The doc says "8 alert rules" but the PrometheusRule manifest only includes 6. Missing:
SentinelDown and SentinelHighSkipRatio. An operator deploying from this manifest will have
incomplete coverage.

I thought this snippet from the document served as "alert examples" (mentioned above) as we have alerts yaml defined under charts/templates/prometheusrule.yaml but yes, this can be confusing if someone tries to use the same manifest. Added the remaining rules here

docs/alerts.md

rafabene · 2026-03-12T19:03:46Z

docs/runbook.md

+
+## Purpose
+
+This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for:


Priority: Pattern

The Purpose section has a dangling colon with no list items — looks like the content got
truncated. Either add the intended list or rephrase the sentence:

Suggested change

This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for:

This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for:

- Understanding built-in reliability features

- Configuring health probes and monitoring

- Diagnosing and recovering from common failure modes

Added the description

…mQL query and add component to each alert

rafabene · 2026-03-13T16:08:42Z

docs/running-sentinel.md

+  region: "resource.labels.region" # CEL expression accessing nested labels.region field
+```
+
+### Multi-Region Configuration


As the audience for this document is for developers that want to run it locally, I think we can drop or move this "Multi-region configuration" to maybe sentinel-operator-guide.md.

I think lines 553 to 573 of sentinel-operator.guide.md cover it

In fact I see that this is already documented in multi-instance-deployment.md.

Maybe it worths a note in sentinel-operator-guide.md:

### Multi-Region Configuration For multi-region deployment examples using `resource_selector`, see [Resource Selector Strategies](multi-instance-deployment.md#resource-selector-strategies).

tirthct added 2 commits March 12, 2026 13:22

Hyperfleet-542 : HYPERFLEET-557: Document Sentinel Reliability and Ob…

7b937d0

…servability

HYPERFLEET-557: Document Remove unnecessary section

ee12151

openshift-ci bot requested review from jsell-rh and vkareh March 12, 2026 18:27

rafabene reviewed Mar 12, 2026

View reviewed changes

docs/runbook.md Outdated Show resolved Hide resolved

rafabene reviewed Mar 12, 2026

View reviewed changes

docs/alerts.md Outdated Show resolved Hide resolved

rafabene reviewed Mar 12, 2026

View reviewed changes

docs/alerts.md Show resolved Hide resolved

rafabene reviewed Mar 12, 2026

View reviewed changes

HYPERFLEET-557: Remove incorrect healthz and authz statement, fix pro…

f075f94

…mQL query and add component to each alert

rafabene reviewed Mar 13, 2026

View reviewed changes


		What: Sentinel responds to SIGTERM/SIGINT signals with controlled shutdown.

		Implementation:


		## Alert Rules Reference

		The following 8 alert rules provide comprehensive monitoring for production Sentinel deployments.


		## Purpose

		This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for:

-This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for:
+This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for:
+  - Understanding built-in reliability features
+  - Configuring health probes and monitoring
+  - Diagnosing and recovering from common failure modes

Conversation

tirthct commented Mar 12, 2026

Uh oh!

openshift-ci bot commented Mar 12, 2026

Uh oh!

rafabene Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rafabene Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rafabene Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rafabene Mar 12, 2026 •

edited

Loading

rafabene Mar 12, 2026 •

edited

Loading

rafabene Mar 13, 2026 •

edited

Loading