test/extended/prometheus: Add test for no alerts in pending state #25112

lilic · 2020-06-15T14:39:34Z

Because our tests run only for short amount of time they might not
catch the alerts that start firing as some components get started
later in the cluster run, might only have alerts in pending state. This
test hopes to catch those alerts that would potentially be in firing
state. But also out of the box we should not have any, ideally, alerts
in pending or firing states.

Lets see if this passes, if not will open bugzilla for each alert in pending state and add it to temp allowlist.

cc @wking

lilic · 2020-06-15T14:40:18Z

cc @openshift/openshift-team-monitoring

wking · 2020-06-15T14:42:14Z

Can we add this to the upgrade suite too?

lilic · 2020-06-15T14:43:02Z

Can we add this to the upgrade suite too?

I would start with this first, and if it works we can do in a separate PR. sgty? I want to give this test some runs to make sure its not going to be flaky.

wking · 2020-06-15T14:44:46Z

Yeah, we could punt. But we want the logic in a helper function, so we can easily use it in both places, right?

lilic · 2020-06-15T14:47:26Z

But we want the logic in a helper function, so we can easily use it in both places, right?

It's a promql query, not sure it needs helper function for it?

wking · 2020-06-15T14:50:00Z

It's presumably going to end up with an exclusion list, right?

wking · 2020-06-15T14:58:31Z

Never mind, looks like the precedent set by #24786 is to have separate PromQL for upgrade and non-upgrade tests. So I'm +1 to this as it stands (assuming CI doesn't turn up any existing pending alerts, as you said earlier).

paulfantom · 2020-06-15T15:08:59Z

test/extended/prometheus/prometheus.go

+		}()
+
+		tests := map[string]bool{
+			`count_over_time(ALERTS{alertstate="pending"}[2h])`: false,


You need to exclude at least the same alerts as in https://github.com/openshift/origin/blob/master/test/extended/prometheus/prometheus.go#L69

openshift-ci-robot · 2020-06-15T15:09:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lilic
To complete the pull request process, please assign smarterclayton
You can assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

~~test/extended/prometheus/OWNERS~~ [lilic]
test/extended/util/annotate/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Because our tests run only for short amount of time they might not catch the alerts that start firing as some components get started later in the cluster run, might only have alerts in pending state. This test hopes to catch those alerts that would potentially be in firing state. But also out of the box we should not have any, ideally, alerts in pending or firing states.

paulfantom · 2020-06-15T15:17:43Z

test/extended/prometheus/prometheus.go

 		e2e.Logf("Watchdog alert is firing")
 	})

+	g.It("should not have any alerts in pending state the entire cluster run", func() {


Could you move it closer https://github.com/openshift/origin/blob/master/test/extended/prometheus/prometheus.go#L54 as those tests are very similar and need almost the same exclusion list?

openshift-ci-robot · 2020-06-15T17:18:33Z

@lilic: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/unit	`46e18b8`	link	`/test unit`
ci/prow/e2e-gcp	`46e18b8`	link	`/test e2e-gcp`
ci/prow/e2e-aws-fips	`46e18b8`	link	`/test e2e-aws-fips`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2020-10-05T23:22:39Z

test/extended/prometheus/prometheus.go

+		}()
+
+		tests := map[string]bool{
+			`count_over_time(ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh", alertstate="pending"}[2h])`: false,


Do we even need the alertstate filter here? I'd rather have:

`count_over_time(ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh"}[2h])`: false

for "there are no surprising alerts in any state over the entire cluster run", to ensure we don't have something like a pending alert that slips through the firing test and then starts firing and slips through a pending-specific test.

lilic · 2020-10-06T09:34:08Z

Closing as it's not as easy as first thought as some alerts can be in pending state due to the nature of our e2e tests but don't eventually fire, so the list of allowing those would be too long. The general idea was to catch any alerts that are in pending state but eventually start firing, which sometimes our current tests miss, but it's not as easy. If someone manages to come up with an idea to make sure we catch all firing alerts, happy to review!

/close

openshift-ci-robot · 2020-10-06T09:34:24Z

@lilic: Closed this PR.

Details

In response to this:

Closing as it's not as easy as first thought as some alerts can be in pending state due to the nature of our e2e tests but don't eventually fire, so the list of allowing those would be too long. The general idea was to catch any alerts that are in pending state but eventually start firing, which sometimes our current tests miss, but it's not as easy. If someone manages to come up with an idea to make sure we catch all firing alerts, happy to review!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from s-urbaniak and squat June 15, 2020 14:39

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2020

paulfantom reviewed Jun 15, 2020

View reviewed changes

openshift-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2020

lilic added 2 commits June 15, 2020 17:13

test/extended/util/annotate/generated: Regenarate file

46e18b8

lilic force-pushed the no-pending-alerts branch from 92253a8 to 46e18b8 Compare June 15, 2020 15:13

paulfantom reviewed Jun 15, 2020

View reviewed changes

wking reviewed Oct 5, 2020

View reviewed changes

openshift-ci-robot closed this Oct 6, 2020

test/extended/prometheus: Add test for no alerts in pending state #25112

test/extended/prometheus: Add test for no alerts in pending state #25112

Uh oh!

Conversation

lilic commented Jun 15, 2020

Uh oh!

lilic commented Jun 15, 2020

Uh oh!

wking commented Jun 15, 2020

Uh oh!

lilic commented Jun 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Jun 15, 2020

Uh oh!

lilic commented Jun 15, 2020

Uh oh!

wking commented Jun 15, 2020

Uh oh!

wking commented Jun 15, 2020

Uh oh!

paulfantom Jun 15, 2020

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Jun 15, 2020

Uh oh!

paulfantom Jun 15, 2020

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Jun 15, 2020

Uh oh!

wking Oct 5, 2020

Choose a reason for hiding this comment

Uh oh!

lilic commented Oct 6, 2020

Uh oh!

openshift-ci-robot commented Oct 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lilic commented Jun 15, 2020 •

edited

Loading