CNF-23436: Add liveness and readiness probes to operator deployment#417
CNF-23436: Add liveness and readiness probes to operator deployment#417sebrandon1 wants to merge 1 commit into
Conversation
|
@sebrandon1: This pull request references CNF-23436 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughAdds HTTPS liveness ( ChangesHealth Probes Configuration
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels: Suggested reviewers:
🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sebrandon1 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
e2f1df3 to
cc2910e
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
config/manager/manager.yaml (1)
114-122: ⚡ Quick winTune readiness probe for faster drain on shutdown.
To better align with graceful termination, Line 120 and Line 122 are a bit slow (
10s * 3worst-case before NotReady). Consider faster readiness failure so endpoints stop routing sooner.Suggested tweak
readinessProbe: httpGet: path: /readyz port: https scheme: HTTPS initialDelaySeconds: 5 - periodSeconds: 10 + periodSeconds: 5 timeoutSeconds: 5 - failureThreshold: 3 + failureThreshold: 1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@config/manager/manager.yaml` around lines 114 - 122, The readinessProbe for the manager (httpGet path "/readyz", scheme HTTPS) is too slow to mark Pod NotReady during shutdown; adjust readinessProbe settings to fail faster by lowering periodSeconds (e.g., from 10 to 2–3), reducing failureThreshold (e.g., from 3 to 1–2) and/or decreasing timeoutSeconds to ensure the probe transitions to NotReady quickly so endpoints are drained sooner; update the readinessProbe block (httpGet path /readyz, initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@config/manager/manager.yaml`:
- Around line 114-122: The readinessProbe for the manager (httpGet path
"/readyz", scheme HTTPS) is too slow to mark Pod NotReady during shutdown;
adjust readinessProbe settings to fail faster by lowering periodSeconds (e.g.,
from 10 to 2–3), reducing failureThreshold (e.g., from 3 to 1–2) and/or
decreasing timeoutSeconds to ensure the probe transitions to NotReady quickly so
endpoints are drained sooner; update the readinessProbe block (httpGet path
/readyz, initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold)
accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f1db3899-53e4-40c3-a38c-c2fc93c4f11f
📒 Files selected for processing (2)
bundle/manifests/cert-manager-operator.clusterserviceversion.yamlconfig/manager/manager.yaml
cc2910e to
d9d40bd
Compare
|
@sebrandon1: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
The operator deployment currently has no health probes, so Kubernetes cannot detect if the operator process is stuck or not yet ready to serve. All cert-manager operands (controller, webhook, cainjector, trust-manager, istio-csr) already have probes configured — the operator itself is the only component missing them.
The library-go
controllercmdframework already serves/healthzand/readyzover HTTPS on port 8443 via its GenericAPIServer, so no Go code changes are needed./healthz(ping, log, post-start hooks)/readyz(same checks +shutdown, so the pod drains traffic during graceful termination)Test plan
Tested locally against an OCP 4.22 cluster:
/healthzand/readyzreturn 200 when operator is healthySummary by CodeRabbit