CORENET-7154: Fix debounce timer for OperatorConfig level being incorrectly cleared#3011
CORENET-7154: Fix debounce timer for OperatorConfig level being incorrectly cleared#3011tpantelis wants to merge 1 commit into
Conversation
|
@tpantelis: This pull request references CORENET-7154 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
💤 Files with no reviewable changes (2)
WalkthroughStatusManager.set now immediately marks ClusterOperator degraded with reason NoOperConfig and message "Failed to get networks.operator.openshift.io cluster" when operator config is missing. Tests were updated to assert the immediate degraded condition instead of using fake-clock debounce checks. ChangesOperator Config Missing Debounce Tracking
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tpantelis The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
I don't understand the description. What's the exact scenario and what's the incorrect behavior in that scenario? What is the user-visible impact of the bug? |
I outlined the scenario in https://redhat.atlassian.net/browse/CORENET-7154. |
| CertificateSigner | ||
| InfrastructureConfig | ||
| DashboardConfig | ||
| operConfigMissing |
There was a problem hiding this comment.
OK, so the problem with this is that operconfig should never actually be missing, and if it was, we'd be fine with immediately going degraded. According to the discussion in #2896 (comment), it seemed like some other error was occurring, which the code was mistakenly classifying as "operator config doesn't exist", and this sporadically happened in some real job, but in the example @jluhrsen linked to, the pod logs have been deleted now so there's no further information.
Anyway, I think we should remove the debouncing for this case, reproduce the failure that led to Jamo adding this check, and then figure out the right fix for it (because it may actually indicate a real problem with the cluster which we are just masking by delaying the degraded status).
There was a problem hiding this comment.
(you can remove operConfigMissing now too)
62ad0d0 to
014baef
Compare
|
/assign @jluhrsen we can test this to try to reproduce the original failure here by doing payload jobs, can't we? |
The debounce timer for the OperatorConfig status level was being incorrectly cleared when the operator config exists. The code was using the OperatorConfig StatusLevel internally to track the debounce timer for when the operator config is missing. However it also blindly cleared the timer when the operator config exists, which is the normal case, potentially interfering with a caller's usage of the OperatorConfig StatusLevel. This commit removes the debouncing for this case as the operconfig should never actually be missing so it was likely some other error that occurred that facilitated adding the debounce. Since it is unknown what caused the original scenario, we shouldn't assume that delaying the degraded status is the correct solution. Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
014baef to
1c1868b
Compare
|
@tpantelis: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The debounce timer for the
OperatorConfigstatus level was being incorrectly cleared when the operator config exists. The code was using theOperatorConfigstatus level internally to track the debounce timer for when the operator config is missing. However it also blindly cleared the timer when the operator config exists, which is the normal case, potentially interfering with a caller's usage of theOperatorConfigstatus level.This commit removes the debouncing for this case as the operconfig should never actually be missing so it was likely some other error that occurred that facilitated adding the debounce. Since it is unknown what caused the original scenario, we shouldn't assume that delaying the degraded status is the correct solution.
Summary by CodeRabbit
Bug Fixes
Tests