OCPBUGS-82110: fix service-ca-controller CrashLoop on MicroShift#344
Conversation
|
@sanchezl: This pull request references Jira Issue OCPBUGS-82110, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (11)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (6)
WalkthroughRefactors feature-gate handling to inject a map of enabled gates into the operator and controllers (via CLI args) instead of detecting gates at runtime via informers/CRDs; several function signatures and call sites now accept/propagate Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/jira refresh |
|
@sanchezl: This pull request references Jira Issue OCPBUGS-82110, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sanchezl: The specified target(s) for The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@sanchezl: The specified target(s) for The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
pkg/controller/servingcert/controller/secret_updating_controller.go (1)
45-60: Keep the forwarded feature-gate state at this boundary.Collapsing the map back to
configurablePKIEnabledhere means the next controller-side gate needed in this layer will force another constructor signature change through the serving-cert stack. Passing the forwarded map through one more hop (or a small typed config derived from it) would keep this refactor extensible. As per coding guidelines, "Controller receives feature gates asmap[string]boolviapkg/cmd/controller/cmd.goand threads through call chain".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/servingcert/controller/secret_updating_controller.go` around lines 45 - 60, The constructor currently collapses the forwarded feature-gate map into a single configurablePKIEnabled bool; instead accept and carry the feature-gates map[string]bool through this boundary (change the parameter to e.g. featureGates map[string]bool) and store that map (or a small typed config derived from it) on the serviceServingCertUpdateController/servingCertIssuer so downstream callers can read additional gates without another signature change; update references to configurablePKIEnabled to read from the map (or typed config) and preserve the existing boolean behavior by deriving it from featureGates["ConfigurablePKI"] during the transition.pkg/operator/sync_common_test.go (1)
115-120: Cover ordering and disabled-gate filtering in this table.
pkg/operator/sync_common.gonow sorts enabled gates and dropsfalseentries, but this table only asserts the single-gate happy path. Adding a multi-gate case and afalse-value case would protect the deployment arg formatting from quiet regressions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/sync_common_test.go` around lines 115 - 120, Add table-driven test cases to pkg/operator/sync_common_test.go covering multiple enabled gates and filtering out false-valued gates: extend the existing test table (the entry using deployment(...).withFeatureGates(...).valueOrDie()) with one case where enabledFeatureGates contains multiple keys in unsorted order (e.g., {"B":true,"A":true}) and assert withFeatureGates expects sorted "A=true","B=true", and another case where a gate is set to false (e.g., {"A":true,"B":false}) and assert that only "A=true" appears; locate and update the test rows that construct expectedDeployment using withFeatureGates to reflect these scenarios so the test verifies both sorting and false-value filtering implemented in sync_common.go.pkg/operator/starter.go (1)
152-175: Consider documenting the startup-only feature-gate snapshot behavior.Feature gates are read once at startup (via
InitialFeatureGatesObserved()), captured inenabledFeatureGates, and used to conditionally configure informers and the PKI provider. This is correct and necessary—informer lists and PKI providers cannot be reconfigured after operator startup. However, this constraint should be explicitly documented (e.g., in code comments) to clarify that later FeatureGate changes will not be picked up until the operator restarts.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/starter.go` around lines 152 - 175, Add a clear comment near the startup feature-gate snapshot code (around the creation of enabledFeatureGates and the call to InitialFeatureGatesObserved()/featureGates usage) stating that feature gates are read once at operator startup, that enabledFeatureGates is a static snapshot used to configure informer lists and the PKI provider, and that subsequent FeatureGate changes will not be applied until the operator restarts; reference the enabledFeatureGates map and the NewServiceCAOperator call so readers understand this startup-only behavior affects informer configuration and the PKI provider.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/controller/servingcert/controller/secret_updating_controller.go`:
- Around line 45-60: The constructor currently collapses the forwarded
feature-gate map into a single configurablePKIEnabled bool; instead accept and
carry the feature-gates map[string]bool through this boundary (change the
parameter to e.g. featureGates map[string]bool) and store that map (or a small
typed config derived from it) on the
serviceServingCertUpdateController/servingCertIssuer so downstream callers can
read additional gates without another signature change; update references to
configurablePKIEnabled to read from the map (or typed config) and preserve the
existing boolean behavior by deriving it from featureGates["ConfigurablePKI"]
during the transition.
In `@pkg/operator/starter.go`:
- Around line 152-175: Add a clear comment near the startup feature-gate
snapshot code (around the creation of enabledFeatureGates and the call to
InitialFeatureGatesObserved()/featureGates usage) stating that feature gates are
read once at operator startup, that enabledFeatureGates is a static snapshot
used to configure informer lists and the PKI provider, and that subsequent
FeatureGate changes will not be applied until the operator restarts; reference
the enabledFeatureGates map and the NewServiceCAOperator call so readers
understand this startup-only behavior affects informer configuration and the PKI
provider.
In `@pkg/operator/sync_common_test.go`:
- Around line 115-120: Add table-driven test cases to
pkg/operator/sync_common_test.go covering multiple enabled gates and filtering
out false-valued gates: extend the existing test table (the entry using
deployment(...).withFeatureGates(...).valueOrDie()) with one case where
enabledFeatureGates contains multiple keys in unsorted order (e.g.,
{"B":true,"A":true}) and assert withFeatureGates expects sorted
"A=true","B=true", and another case where a gate is set to false (e.g.,
{"A":true,"B":false}) and assert that only "A=true" appears; locate and update
the test rows that construct expectedDeployment using withFeatureGates to
reflect these scenarios so the test verifies both sorting and false-value
filtering implemented in sync_common.go.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e2d4043b-8422-4a4e-8c9d-3a4b76c991a0
📒 Files selected for processing (11)
CLAUDE.mdpkg/cmd/controller/cmd.gopkg/controller/servingcert/controller/secret_creating_controller.gopkg/controller/servingcert/controller/secret_updating_controller.gopkg/controller/servingcert/starter/starter.gopkg/controller/starter.gopkg/operator/operator.gopkg/operator/rotate.gopkg/operator/starter.gopkg/operator/sync_common.gopkg/operator/sync_common_test.go
|
/payload-job periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview |
|
/payload-job periodic-ci-openshift-microshift-release-4.22-periodics-e2e-aws-ovn-ocp-conformance |
|
@sanchezl: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b3637960-337f-11f1-9c82-aa9cfbc3ba61-0 |
|
@sanchezl: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b5e8ebc0-337f-11f1-853b-7dbce084b4eb-0 |
Replace individual feature gate bools (shortCertRotationEnabled, configurablePKIEnabled) with a single enabledFeatureGates map[string]bool throughout the operator. The operator detects enabled gates and builds a single --feature-gates=Key1=true,Key2=true CLI arg for the controller Deployment. This improves scalability: adding a new feature gate no longer requires plumbing a new bool parameter through multiple function signatures.
The controller process was using FeatureGateAccess with ClusterVersion and FeatureGate informers to detect enabled feature gates at runtime. On MicroShift, these CRDs do not exist, causing the informers to fail repeatedly and the controller to crash after a 1-minute timeout. Instead, use the feature gates already forwarded via --feature-gates CLI args from the operator process. Config informers for the PKI resource are now only created when ConfigurablePKI is explicitly enabled, avoiding any dependency on CRDs that may not exist. Fixes: OCPBUGS-82110
69aba4c to
d3b5a4d
Compare
|
/payload-job periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview |
|
/payload-job periodic-ci-openshift-microshift-release-4.22-periodics-e2e-aws-ovn-ocp-conformance |
|
@sanchezl: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3b289870-3381-11f1-9d1b-0cd4a2b38680-0 |
|
@sanchezl: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3ba6a210-3381-11f1-819c-08aea589ab2e-0 |
|
@sanchezl: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: benluddy, sanchezl The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified by CI |
|
@sanchezl: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sanchezl: This pull request references Jira Issue OCPBUGS-82110, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sanchezl: Jira Issue OCPBUGS-82110: Some pull requests linked via external trackers have merged: The following pull request, linked via external tracker, has not merged:
All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-82110 has not been moved to the MODIFIED state. This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Summary
--feature-gatesCLI arg built from amap[string]bool, replacing individual bool parametersConfigurablePKIis explicitly enabledRoot Cause
PR #327 added
FeatureGateAccessinitialization to the controller process (pkg/controller/servingcert/starter/starter.go) that creates informers forClusterVersionandFeatureGateCRDs. On MicroShift, these CRDs do not exist, so the informers fail repeatedly and the controller crashes after a 1-minute timeout.Test Plan
make buildmake test-unitClusterVersion/FeatureGateCRDs