Skip to content

Hyperfleet-557 : Document Sentinel Reliability and Observability#73

Open
tirthct wants to merge 3 commits intoopenshift-hyperfleet:mainfrom
tirthct:hyperfleet-557
Open

Hyperfleet-557 : Document Sentinel Reliability and Observability#73
tirthct wants to merge 3 commits intoopenshift-hyperfleet:mainfrom
tirthct:hyperfleet-557

Conversation

@tirthct
Copy link
Contributor

@tirthct tirthct commented Mar 12, 2026

Changed

  • Add documentation for Sentinel reliability and observability

@openshift-ci openshift-ci bot requested review from jsell-rh and vkareh March 12, 2026 18:27
@openshift-ci
Copy link

openshift-ci bot commented Mar 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yasun1 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


**Implementation**:

**Liveness Probe** (`/healthz`):
Copy link
Contributor

@rafabene rafabene Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The health check descriptions here don't match the actual implementation:

/healthz: The code (internal/health/health.go) only checks poll staleness (time since last
successful poll vs 3 * PollInterval). It does not check broker connection status — that check
is registered on /readyz. Suggestion:

  - Checks poll staleness (dead man's switch)
  - Returns 200 OK if last successful poll is within threshold (3 × poll_interval)
  - Returns 200 OK before first poll completes (grace period)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the health check description

- **Failure threshold**: 3 consecutive failures
- **Period**: 20 seconds

**Readiness Probe** (`/readyz`):
Copy link
Contributor

@rafabene rafabene Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/readyz: The code only registers two checks: "broker" (calls pub.Health(ctx)) and
"sentinel_poll" (verifies at least one successful poll). There is no explicit "configuration
loaded" or "API connectivity" check. Suggestion:

  - Checks broker connection health
  - Verifies at least one successful poll cycle has completed
  - Returns 200 OK when both checks pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. Modified the description


**What**: Sentinel responds to SIGTERM/SIGINT signals with controlled shutdown.

**Implementation**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: Bug

Line 46: "Publishes any pending events before shutdown" — this doesn't match the shutdown
implementation. The signal handler calls cancel() on the context immediately, then shuts down
the HTTP servers with a 20s timeout. There is no explicit event flush/drain step.

If a polling cycle is mid-execution when shutdown is triggered, in-flight publishes may
complete (since cancel() doesn't interrupt running goroutines mid-function), but they may also
fail because downstream calls receive a cancelled context. This is best-effort, not
guaranteed.

Suggested replacement:

  - Listens for termination signals during main polling loop
  - If a polling cycle is in progress, in-flight operations may complete (best-effort)
  - Maximum shutdown time: 20 seconds for HTTP server shutdown
  - Cleans up broker connections gracefully

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "Publishes any pending events before shutdown"

docs/alerts.md Outdated
- alert: SentinelNoEventsPublished
expr: |
rate(hyperfleet_sentinel_events_published_total[15m]) == 0
AND hyperfleet_sentinel_pending_resources > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
AND hyperfleet_sentinel_pending_resources > 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


## Alert Rules Reference

The following 8 alert rules provide comprehensive monitoring for production Sentinel deployments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc says "8 alert rules" but the PrometheusRule manifest only includes 6. Missing:
SentinelDown and SentinelHighSkipRatio. An operator deploying from this manifest will have
incomplete coverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this snippet from the document served as "alert examples" (mentioned above) as we have alerts yaml defined under charts/templates/prometheusrule.yaml but yes, this can be confusing if someone tries to use the same manifest. Added the remaining rules here


## Purpose

This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: Pattern

The Purpose section has a dangling colon with no list items — looks like the content got
truncated. Either add the intended list or rephrase the sentence:

Suggested change
This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for:
This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for:
- Understanding built-in reliability features
- Configuring health probes and monitoring
- Diagnosing and recovering from common failure modes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the description

region: "resource.labels.region" # CEL expression accessing nested labels.region field
```

### Multi-Region Configuration
Copy link
Contributor

@rafabene rafabene Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the audience for this document is for developers that want to run it locally, I think we can drop or move this "Multi-region configuration" to maybe sentinel-operator-guide.md.

I think lines 553 to 573 of sentinel-operator.guide.md cover it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I see that this is already documented in multi-instance-deployment.md.

Maybe it worths a note in sentinel-operator-guide.md:

  ### Multi-Region Configuration

  For multi-region deployment examples using `resource_selector`, see [Resource Selector
  Strategies](multi-instance-deployment.md#resource-selector-strategies).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants