-
Notifications
You must be signed in to change notification settings - Fork 65
Convert PromQL risks to Always risks in releases older than 8 weeks #2968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert PromQL risks to Always risks in releases older than 8 weeks #2968
Conversation
blocked-edges/4.10.0-fc.2-modified-aws-load-balancer-service.yaml
Outdated
Show resolved
Hide resolved
This works around a CVO bug/design decision where we only evaluate one newly enumerated risk every 10 minutes in an effort to avoid overwhelming the monitoring stack with requests. However this creates a bad UX where if there are many risks to evaluate in the set of available update paths it could be N-1 * 10 minutes before the set of recommended updates are computed. This preserves the notification of issue while largely being a no-op because clusters have, currently, had better update paths for at least 12 weeks. We intend to fix the CVO bug, but that won't fix the issue in the deployed fleet. See: https://issues.redhat.com/browse/OCPBUGS-5469
abdd39b to
370f8a7
Compare
| topk(1, | ||
| label_replace(group(ceph_health_status), "ceph", "yes", "", "") | ||
| or | ||
| label_replace(0 * group(cluster_version), "ceph", "no", "", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly shift this PromQL into the message? Or the linked bug (although this one links https://bugzilla.redhat.com/show_bug.cgi?id=2076312#c9, which seems to be private)? But the current message is phrased as if we know (or suspect) the cluster is exposed, while with Always this will also trip for clusters we know are not exposed. Or 🤷, maybe that's more polish than we care about for such an old 4.10.z target as these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to just leave it as is. Only 17% of the 4.10 fleet has upgrades to < 4.10.17 and those all have paths to better edges listed more prominently.
|
@sdodson: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
wking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sdodson, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This works around a CVO bug/design decision where we only evaluate one newly enumerated risk every 10 minutes in an effort to avoid overwhelming the monitoring stack with requests. However this creates a bad UX where if there are many risks to evaluate in the set of available update paths it could be N-1 * 10 minutes before the set of recommended updates are computed.
This preserves the notification of issue while largely being a no-op because clusters have, currently, had better update paths for at least 12 weeks.
We intend to fix the CVO bug, but that won't fix the issue in the deployed fleet.
See: https://issues.redhat.com/browse/OCPBUGS-5469