-
Notifications
You must be signed in to change notification settings - Fork 216
OCPBUGS-9050: Alert on failing conditional update risk evaluation #961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-9050: Alert on failing conditional update risk evaluation #961
Conversation
|
@petr-muller: This pull request references Jira Issue OCPBUGS-9050, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Skipping CI for Draft Pull Request. |
|
/test all |
384f24b to
4261ee7
Compare
|
/test all |
4261ee7 to
d032eeb
Compare
|
/test all |
d032eeb to
bc442fd
Compare
|
/test all |
|
/retest |
bc442fd to
14199a3
Compare
|
/test all |
|
/jira refresh |
|
@petr-muller: This pull request references Jira Issue OCPBUGS-9050, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Needs #964 first |
|
/jira refresh |
|
@petr-muller: This pull request references Jira Issue OCPBUGS-9050, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
14199a3 to
db1a301
Compare
- Used `max by` in the alert to drop irrelevant labels from alert - Replaced further `"Recommended"` literals with new constant - Do not `ToLower` booleanish string labels - Report metrics labelled with `condition` and `status` instead of a specialized `recommended` label. We only export the `Recommended` condition status still. - Naming and code structure tweaks
ed62379 to
e30c47c
Compare
If the CVO is just starting up, it should populate its "known state" of available updates from the `ClusterVersion` status, if its contains them. Previous CVO may have evaluated the same graph data like the the current one is about to do and if it did, there are likely existing conditions in the status that we need to respect (for example, do not bump a `lastTransitionTime` field on a condition on a conditinal update that was already evaluated with the same result.
e30c47c to
983b3d0
Compare
|
This should be now ready for another round of review and testing. The underlying metrics was pivoted to be timestamp-based rather than duration based (this is a followup of the Slack conversation where we discovered this upstream recommendation). New commit 983b3d0 makes CVO persist conditions known about conditional updates from the |
| } | ||
| } | ||
|
|
||
| // Collect collects metrics from the operator into the channel ch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: truncated comment
wking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/label backport-risk-assessed |
|
/hold cancel Lot of failures, nothing indicates CVO |
|
/hold This was held because it waits for testing (the |
|
Pre-merge testing // Patch dummy cincy // Recommended=Unknown update condition is present // After 1 hour, the alert CannotEvaluateConditionalUpdates fires // Delete CVO pod // The alert survives Looks good. |
|
/label qe-approved |
|
/hold cancel |
|
@petr-muller: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@petr-muller: Jira Issue OCPBUGS-9050: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-9050 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Fix included in accepted release 4.15.0-0.nightly-2023-10-27-135451 |

Add a new
cluster_version_conditional_updates_recommended_conditions_secondswhich, for each conditional update known to CVO, reports how long (in seconds) is theRecommendedcondition on that update in the current state (this is the time since the conditions'lastTransitionTime). The metric is labelled with version, reason and status. Note thatlastTransitionTimewas not correctly maintained by CVO until fixed in #964.Using this metric, we create an alert that fires when there is an update that is in an
Unknownstatus for more than 50 minutes and this state is maintained for 10 minutes.The metric was created slightly more generic than the alert would have needed. Technically the CVO could compute the "bad" state right away and just flap a 0/1 metric, but I believe the metrics are more useful when slightly generic.