CNF-23436: Add liveness and readiness probes to operator deployment by sebrandon1 · Pull Request #417 · openshift/cert-manager-operator

sebrandon1 · 2026-05-01T16:41:20Z

Summary

The operator deployment currently has no health probes, so Kubernetes cannot detect if the operator process is stuck or not yet ready to serve. All cert-manager operands (controller, webhook, cainjector, trust-manager, istio-csr) already have probes configured — the operator itself is the only component missing them.

The library-go controllercmd framework already serves /healthz and /readyz over HTTPS on port 8443 via its GenericAPIServer, so no Go code changes are needed.

Liveness → /healthz (ping, log, post-start hooks)
Readiness → /readyz (same checks + shutdown, so the pod drains traffic during graceful termination)

Test plan

Tested locally against an OCP 4.22 cluster:

$ curl -sk "https://localhost:8443/healthz?verbose"
[+]ping ok
[+]log ok
[+]poststarthook/max-in-flight-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
healthz check passed

$ curl -sk "https://localhost:8443/readyz?verbose"
[+]ping ok
[+]log ok
[+]poststarthook/max-in-flight-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]shutdown ok
readyz check passed

Operator deploys and reports ready
/healthz and /readyz return 200 when operator is healthy
Pod is restarted by kubelet when liveness probe fails
Pod is removed from service endpoints during graceful shutdown via readyz shutdown check

Summary by CodeRabbit

Chores
- Added HTTPS liveness (/healthz) and readiness (/readyz) probes to the cert-manager-operator container on the existing HTTPS port. Probes include initial delay, interval, timeout and failure-threshold settings to improve health monitoring, speed accurate readiness reporting during startup, and reduce false failure detections during rolling updates.

openshift-ci-robot · 2026-05-01T16:41:24Z

@sebrandon1: This pull request references CNF-23436 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

The operator deployment currently has no health probes, so Kubernetes cannot detect if the operator process is stuck or not yet ready to serve. All cert-manager operands (controller, webhook, cainjector, trust-manager, istio-csr) already have probes configured — the operator itself is the only component missing them.

The library-go controllercmd framework already serves /healthz and /readyz over HTTPS on port 8443 via its GenericAPIServer, so no Go code changes are needed.

Liveness → /healthz (ping, log, post-start hooks)

Readiness → /readyz (same checks + shutdown, so the pod drains traffic during graceful termination)

Test plan

Tested locally against an OCP 4.22 cluster:
$ curl -sk "https://localhost:8443/healthz?verbose"
[+]ping ok
[+]log ok
[+]poststarthook/max-in-flight-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
healthz check passed

$ curl -sk "https://localhost:8443/readyz?verbose"
[+]ping ok
[+]log ok
[+]poststarthook/max-in-flight-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]shutdown ok
readyz check passed
Operator deploys and reports ready

/healthz and /readyz return 200 when operator is healthy

Pod is restarted by kubelet when liveness probe fails

Pod is removed from service endpoints during graceful shutdown via readyz shutdown check

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-01T16:41:32Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bc6779f3-f1c8-4f09-9c49-4d2c142511ac

📥 Commits

Reviewing files that changed from the base of the PR and between cc2910e and d9d40bd.

📒 Files selected for processing (2)

bundle/manifests/cert-manager-operator.clusterserviceversion.yaml
config/manager/manager.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

bundle/manifests/cert-manager-operator.clusterserviceversion.yaml
config/manager/manager.yaml

Walkthrough

Adds HTTPS liveness (/healthz) and readiness (/readyz) probes to the cert-manager-operator container in two Kubernetes manifests, targeting the existing named https port and specifying probe timing parameters (initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold).

Changes

Health Probes Configuration

Layer / File(s)	Summary
Container probe spec (dev/run manager) `config/manager/manager.yaml`	Adds `livenessProbe` (HTTPS GET `/healthz`) and `readinessProbe` (HTTPS GET `/readyz`) to the `cert-manager-operator` container. Probes use port name `https` and include `initialDelaySeconds`, `periodSeconds`, `timeoutSeconds`, and `failureThreshold`.
Container probe spec (CSV bundle) `bundle/manifests/cert-manager-operator.clusterserviceversion.yaml`	Same `livenessProbe` and `readinessProbe` blocks added to the CSV container spec, using HTTPS on the named `https` port (8443) with matching timing parameters.
CSV container field ordering `bundle/manifests/cert-manager-operator.clusterserviceversion.yaml`	Reorders container fields so `name` and `ports` are positioned adjacent to the newly inserted probe blocks.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels: lgtm, qe-approved, approved, px-approved

Suggested reviewers:

swghosh
TrilokGeer
chiragkyal
bharath-b-rh

🚥 Pre-merge checks | ✅ 12

✅ Passed checks (12 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: adding liveness and readiness probes to operator deployment.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR contains only Kubernetes manifest changes (health probes added to deployment config). No Ginkgo tests were modified or added. The custom check is not applicable to this PR.
Test Structure And Quality	✅ Passed	PR contains only YAML manifest changes. The check for Ginkgo test code quality is not applicable since no test code was modified.
Microshift Test Compatibility	✅ Passed	Custom check not applicable. The PR adds health probes to deployment manifests only (YAML changes), not Ginkgo e2e tests. Check specifically applies to new test code additions.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	Tests added use OpenShift APIs available on SNO. Scheduling tests verify config is applied to deployment spec only, not actual pod placement. SNO has control-plane node with required label.
Topology-Aware Scheduling Compatibility	✅ Passed	PR adds only health probes; no topology-unfriendly scheduling constraints introduced. Compatible with all OpenShift topologies.
Ote Binary Stdout Contract	✅ Passed	PR modifies only Kubernetes manifests (YAML files). No Go source code, process-level code, or logging configuration changes. The check is not applicable.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	Custom check is not applicable: This PR contains only Kubernetes manifest changes (YAML files) with no new Ginkgo e2e tests. Check requires tests to be present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-01T16:41:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sebrandon1
Once this PR has been reviewed and has the lgtm label, please assign swghosh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

🧹 Nitpick comments (1)

config/manager/manager.yaml (1)

114-122: ⚡ Quick win

Tune readiness probe for faster drain on shutdown.

To better align with graceful termination, Line 120 and Line 122 are a bit slow (10s * 3 worst-case before NotReady). Consider faster readiness failure so endpoints stop routing sooner.

Suggested tweak

           readinessProbe:
             httpGet:
               path: /readyz
               port: https
               scheme: HTTPS
             initialDelaySeconds: 5
-            periodSeconds: 10
+            periodSeconds: 5
             timeoutSeconds: 5
-            failureThreshold: 3
+            failureThreshold: 1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/manager/manager.yaml` around lines 114 - 122, The readinessProbe for
the manager (httpGet path "/readyz", scheme HTTPS) is too slow to mark Pod
NotReady during shutdown; adjust readinessProbe settings to fail faster by
lowering periodSeconds (e.g., from 10 to 2–3), reducing failureThreshold (e.g.,
from 3 to 1–2) and/or decreasing timeoutSeconds to ensure the probe transitions
to NotReady quickly so endpoints are drained sooner; update the readinessProbe
block (httpGet path /readyz, initialDelaySeconds, periodSeconds, timeoutSeconds,
failureThreshold) accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@config/manager/manager.yaml`:
- Around line 114-122: The readinessProbe for the manager (httpGet path
"/readyz", scheme HTTPS) is too slow to mark Pod NotReady during shutdown;
adjust readinessProbe settings to fail faster by lowering periodSeconds (e.g.,
from 10 to 2–3), reducing failureThreshold (e.g., from 3 to 1–2) and/or
decreasing timeoutSeconds to ensure the probe transitions to NotReady quickly so
endpoints are drained sooner; update the readinessProbe block (httpGet path
/readyz, initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold)
accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f1db3899-53e4-40c3-a38c-c2fc93c4f11f

📥 Commits

Reviewing files that changed from the base of the PR and between e2f1df3 and cc2910e.

📒 Files selected for processing (2)

bundle/manifests/cert-manager-operator.clusterserviceversion.yaml
config/manager/manager.yaml

openshift-ci · 2026-05-14T21:57:59Z

@sebrandon1: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 1, 2026

openshift-ci Bot requested review from TrilokGeer and swghosh May 1, 2026 16:41

sebrandon1 force-pushed the add-operator-health-probes branch from e2f1df3 to cc2910e Compare May 5, 2026 22:37

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

CNF-23436: Add liveness and readiness probes to operator deployment

d9d40bd

sebrandon1 force-pushed the add-operator-health-probes branch from cc2910e to d9d40bd Compare May 14, 2026 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNF-23436: Add liveness and readiness probes to operator deployment#417

CNF-23436: Add liveness and readiness probes to operator deployment#417
sebrandon1 wants to merge 1 commit into
openshift:masterfrom
sebrandon1:add-operator-health-probes

sebrandon1 commented May 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci-robot commented May 1, 2026 •

edited by openshift-ci Bot

Loading

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 1, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented May 1, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

openshift-ci Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sebrandon1 commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented May 1, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci Bot commented May 1, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sebrandon1 commented May 1, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented May 1, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading