Adjust alert severities #520

michaelgugino · 2021-02-12T18:57:26Z

The alerts that are currently "critical" are not actually
situations that jeopardize the cluster's immediate health
or ability to run workloads. No one needs to be paged
in the middle of the night for these alerts.

This commit reduces severity to warning to reflect
degraded state.

The alerts that are currently "critical" are not actually situations that jeopardize the cluster's immediate health or ability to run workloads. No one needs to be paged in the middle of the night for these alerts. This commit reduces severity to warning to reflect degraded state.

openshift-ci-robot · 2021-02-12T18:58:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: michaelgugino
To complete the pull request process, please assign jottofar after the PR has been reviewed.
You can assign the PR to them by writing /assign @jottofar in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sdodson · 2021-02-12T19:30:44Z

/hold
I don't disagree with this, but in order to reduce churn I'd like to see openshift/enhancements#637 accepted first

wking · 2021-02-18T05:46:34Z

I'm not really interested in getting in the middle here. @smarterclayton was pushing reasonably firmly for critical back when these all landed. I'll let folks fight out what they think the severity level should be, and focus on trying to reduce the number of times any of these go off ;).

smarterclayton · 2021-04-26T19:17:15Z

As per discussions in the alert enhancement, and in slack, and a general "when is available and degraded important" - degraded generally means 'still working but not making progress / unable to give full status'. By that definition degraded =~ warning, and timing should be "in the morning".

By that measure available=false = critical, I am ok (if trevor is) with splitting ClusterOperatorDown to only include available and reachable and be critical, then to make ClusterOperatorDegraded cover the degraded condition and have a longer for.

Finally, if we do that, suppressing operation condition alerts until install is complete (until the operator reaches level) may also be useful.

michaelgugino · 2021-04-27T12:44:09Z

These CVO conditions clearly don't meet the criteria of 'get out of bed in the middle of the night.' Workloads will continue to run, new workloads will continue to be scheduled, existing operators will continue to run. The CVO being completely down could persist indefinitely with no adverse impact on workloads.

wking · 2021-05-07T00:03:25Z

... ClusterOperatorDegraded cover the degraded condition and have a longer for.

I've spun this portion out into #554. I'm more interested in the longer for, but I also dropped the severity to warning. Let me know if you'd rather me leave the severity changing to this PR or not. I haven't addressed ClusterOperatorDown's timing yet, but I'll try to get something in place for that soon, and then folks can tell me if they want me to twiddle severity for it in that PR or leave that to this PR as well.

openshift-ci · 2021-06-22T01:01:10Z

@michaelgugino: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-agnostic-upgrade	`5752ec3`	link	`/test e2e-agnostic-upgrade`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2021-09-20T20:43:42Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-10-20T21:12:11Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-11-19T21:45:22Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-11-19T21:48:57Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from crawford and wking February 12, 2021 18:58

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 12, 2021

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 20, 2021

openshift-ci bot closed this Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adjust alert severities #520

Adjust alert severities #520

Uh oh!

michaelgugino commented Feb 12, 2021

Uh oh!

openshift-ci-robot commented Feb 12, 2021

Uh oh!

sdodson commented Feb 12, 2021

Uh oh!

wking commented Feb 18, 2021

Uh oh!

smarterclayton commented Apr 26, 2021

Uh oh!

michaelgugino commented Apr 27, 2021

Uh oh!

wking commented May 7, 2021

Uh oh!

openshift-ci bot commented Jun 22, 2021

Uh oh!

openshift-bot commented Sep 20, 2021

Uh oh!

openshift-bot commented Oct 20, 2021

Uh oh!

openshift-bot commented Nov 19, 2021

Uh oh!

openshift-ci bot commented Nov 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Adjust alert severities #520

Adjust alert severities #520

Uh oh!

Conversation

michaelgugino commented Feb 12, 2021

Uh oh!

openshift-ci-robot commented Feb 12, 2021

Uh oh!

sdodson commented Feb 12, 2021

Uh oh!

wking commented Feb 18, 2021

Uh oh!

smarterclayton commented Apr 26, 2021

Uh oh!

michaelgugino commented Apr 27, 2021

Uh oh!

wking commented May 7, 2021

Uh oh!

openshift-ci bot commented Jun 22, 2021

Uh oh!

openshift-bot commented Sep 20, 2021

Uh oh!

openshift-bot commented Oct 20, 2021

Uh oh!

openshift-bot commented Nov 19, 2021

Uh oh!

openshift-ci bot commented Nov 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants