-
Notifications
You must be signed in to change notification settings - Fork 213
Adjust alert severities #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust alert severities #520
Conversation
The alerts that are currently "critical" are not actually situations that jeopardize the cluster's immediate health or ability to run workloads. No one needs to be paged in the middle of the night for these alerts. This commit reduces severity to warning to reflect degraded state.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: michaelgugino The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/hold |
|
I'm not really interested in getting in the middle here. @smarterclayton was pushing reasonably firmly for |
|
As per discussions in the alert enhancement, and in slack, and a general "when is available and degraded important" - degraded generally means 'still working but not making progress / unable to give full status'. By that definition degraded =~ warning, and timing should be "in the morning". By that measure available=false = critical, I am ok (if trevor is) with splitting ClusterOperatorDown to only include available and reachable and be critical, then to make ClusterOperatorDegraded cover the degraded condition and have a longer Finally, if we do that, suppressing operation condition alerts until install is complete (until the operator reaches level) may also be useful. |
|
These CVO conditions clearly don't meet the criteria of 'get out of bed in the middle of the night.' Workloads will continue to run, new workloads will continue to be scheduled, existing operators will continue to run. The CVO being completely down could persist indefinitely with no adverse impact on workloads. |
I've spun this portion out into #554. I'm more interested in the longer |
|
@michaelgugino: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
|
@openshift-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The alerts that are currently "critical" are not actually
situations that jeopardize the cluster's immediate health
or ability to run workloads. No one needs to be paged
in the middle of the night for these alerts.
This commit reduces severity to warning to reflect
degraded state.