test: Require no alerts during upgrades #25904

smarterclayton · 2021-02-17T23:01:19Z

It is unacceptable to fire alerts during normal upgrades. Tighten
the constraints on the upgrade tests to ensure that any alerts
fail the upgrade test. If the test is skipped, continue to report
which alerts fired. Test over the entire duration of the test like
we do in the post-run variation.

openshift-ci-robot · 2021-02-17T23:01:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2021-02-19T20:21:25Z

/retest

smarterclayton · 2021-02-22T15:26:04Z

/test e2e-gcp-upgrade

smarterclayton · 2021-02-22T19:26:54Z

/retest

test/e2e/upgrade/alert/alert.go

wking · 2021-02-22T19:56:57Z

test/e2e/upgrade/alert/alert.go

-	time.Sleep(alertCheckSleep)
-	cancel()
+	g.By("Waiting before checking for alerts")
+	time.Sleep(2 * time.Minute)


Not clear to me why we're reducing from alertCheckSleepMinutes = 5 to 2m here. I don't have strong feelings either way, but it's nice when folks who do feel like a constant needs tweaking explain their motivation for the tweak, so the rest of us don't forget to consider whatever the motivation was in future changes.

Actually, I'm wondering if we should be checking for zero pending alerts here. Since the wait is basically arbitrary.

Pending alerts might flap. In the case where we don't alert for a condition if the control plane is not ready (or in the future if a particular node is down), as soon as the control plane is ready or the node is up, we will meet the criteria of the alert for pending if a pod can't come up in the monitoring interval.

I originally made this 5 minutes based on my test runs and what I thought would always be long enough to ensure the upgrade is complete without being excessively long.
Seems the use of constants is different than what I'm used to although not necessarily wrong.

Things happen during updates that may trip alerts into pending without anything actually concerning happening. I suspect we want to allow that. But even if we decide we don't want to allow pending alerts, this current pivot to blocking almost all firing alerts is already big enough, that I'd prefer to punt pending discussion to follow-up work.

I'm changing this so that after upgrade there should be no pending alerts. I see none that aren't serious bugs in the few samples I have here.

I'm changing the interval to 1m and requiring no pending.

test/e2e/upgrade/alert/alert.go

smarterclayton · 2021-02-22T22:11:26Z

I changed the pending check, it didn't make sense (you can't know by looking at a prometheus instance when it was up, because you can't tell the difference between planned outage and crash). Now the check is "is the watchdog alert firing continuously during upgrade EXCEPT when this other rule is not firing". The real invariant we're trying to maintain is that "if you query thanos-querier for the range of time over the upgrade, the watchdog query shows up" which is really "thanos querier sees all the data and doesn't violate its SLO to be up continuosly across safe rolling updates" which is a product SLO.

wking · 2021-02-22T22:18:58Z

test/e2e/upgrade/alert/alert.go

+	// when the cluster is running. We do not use the prometheus_* metrics because they can't be reviewed
+	// historically.
+	// TODO: switch to using thanos querier and verify that we can query the entire history of the upgrade
+	watchdogQuery := fmt.Sprintf(`sum_over_time((ALERTS{alertstate="firing",alertname="Watchdog",severity="none"} * scalar(absent(max(openshift:prometheus_tsdb_head_samples_appended_total:sum*0+1))))[%s:1s]) > 0`, testDuration)


Poking at the CI job with PromeCIeus: update from 2021-02-22T20:35:58Z to 2021-02-22T21:40:29Z, and Prom was down from 21:24 to 21:26:

Is Prom going down like that mid-update acceptable? If it is, I think we should drop the Watchdog guard. If not, I think we should drop the scalar(absent(... business.

This is the thanos-querier change i mentioned in slack.

wking · 2021-02-22T22:24:53Z

test/e2e/upgrade/alert/alert.go

-	// Query to check for any critical severity alerts that have occurred within the last alertPeriodCheckMinutes.
-	criticalAlertQuery := fmt.Sprintf(`count_over_time(ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh",alertstate="firing",severity="critical"}[%dm]) >= 1`, alertPeriodCheckMinutes)
+	// There should be no pending alerts 1m after the upgrade completes
+	pendingAlertQuery := `ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured",alertstate="pending",severity!="info"}`


nit: I think we can drop Watchdog here, because it should never be pending.

wking · 2021-02-23T05:33:37Z

Update job hit the hostPort: Invalid value: 10301: Host ports are not allowed to be used thing and also failed the Watchdog guard (empty results []) and the firing guard (ClusterOperatorDegraded for authentication and openshift-apiserver, among others),

smarterclayton · 2021-02-23T20:17:02Z

Last failure was an actual upgraed failure (nodes were still upgrading after cluster allegedly reached level = bad).

/test e2e-gcp-upgrade

smarterclayton · 2021-02-23T21:44:41Z

Specifically excluded etcdMemberCommunicationSlow from pending - it is reasonable for it to go pending due to the IO latency caused by the node OS upgrade impacting disk latency impacting etcd network latency (last master upgrades right before upgrade completes causing instantaneous etcd spike, etcd alert has a 5m rate to smooth peaks, we test 1-2m after, we hit the alert).

smarterclayton · 2021-02-23T22:28:00Z

/test e2e-gcp-upgrade
/test e2e-gcp
/test e2e-aws-fips

smarterclayton · 2021-02-24T21:35:38Z

/retest

smarterclayton · 2021-03-08T21:54:35Z

/test e2e-gcp-upgrade

smarterclayton · 2021-03-09T17:06:07Z

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1369043944154861568 is operator-framework/operator-lifecycle-manager#2024 which is still under review

/retest

smarterclayton · 2021-03-11T02:23:29Z

/retest

smarterclayton · 2021-03-11T05:49:31Z

/retest

smarterclayton · 2021-03-11T14:47:00Z

/retest

This ensures that default queries aggregate results from the running instances (during upgrades) and helps us test invariants like "there is always an alert firing even during upgrades". Note that this makes reconstructing a test more complex because we can no longer exactly reproduce a query (by downloading the prometheus data from a job) but that was already true because we would get round robined to one of the instances and some data is implicitly missing.

smarterclayton · 2021-03-15T20:39:21Z

/test e2e-gcp-upgrade

smarterclayton · 2021-03-16T20:39:21Z

/test e2e-gcp-upgrade

Was the pods going back to pending issue

smarterclayton · 2021-03-17T00:24:17Z

/retest

I think this is ready for merge (pending is always flake, we have all the exceptions catalogued with bugs), will push it through

smarterclayton · 2021-03-17T02:25:49Z

/retest

smarterclayton · 2021-03-17T22:03:04Z

/retest

It is unacceptable to fire alerts during normal upgrades. Tighten the constraints on the upgrade tests to ensure that any alerts fail the upgrade test. If the test is skipped, continue to report which alerts fired. Test over the entire duration of the test like we do in the post-run variation.

smarterclayton · 2021-03-18T12:52:30Z

/retest

openshift-ci · 2021-03-18T15:20:31Z

@smarterclayton: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-metal-ipi-ovn-ipv6	`59ef04b`	link	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-gcp-upgrade	`59ef04b`	link	`/test e2e-gcp-upgrade`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

smarterclayton · 2021-03-18T16:50:49Z

I am going to merge this to start preventing regressions of new alerts. We have all known alerts over the last ~20 or so runs in an exception or fixed.

Goodbye alerts during upgrade.

Since pendingAlertQuery landed in 59ef04b (test: Require no alerts during upgrades, 2021-03-17, openshift#25904), it has been measuring pending alerts firing when the test was one, and the value has always been 1. That would get rendered as: alert ... pending for 1 seconds with labels... regardless of how long the alert had been pending before the query was run. This commit replaces the query with a new one that: * Uses: ALERTS{...} unless ALERTS offset 1s to trigger on edges where the alert began firing. * Uses last_over_time to find the most recent edge trigger. * Uses: time() - last_over_time(time() * ...) to figure out how long ago that most-recent trigger was. * Multiplies by ALERTS again to exclude alerts that are no longer pending by the time the query runs.

Since pendingAlertQuery landed in 59ef04b (test: Require no alerts during upgrades, 2021-03-17, openshift#25904), it has been measuring pending alerts firing when the test was one, and the value has always been 1. That would get rendered as: alert ... pending for 1 seconds with labels... regardless of how long the alert had been pending before the query was run. This commit replaces the query with a new one whose value is actually the number of seconds that a currently-pending alert has been firing. The new query: * Uses: ALERTS{...} unless ALERTS offset 1s to trigger on edges where the alert began firing. * Uses last_over_time to find the most recent edge trigger. * Uses: time() - last_over_time(time() * ...) to figure out how long ago that most-recent trigger was. * Multiplies by ALERTS again to exclude alerts that are no longer pending by the time the query runs.

Since pendingAlertQuery landed in 59ef04b (test: Require no alerts during upgrades, 2021-03-17, openshift#25904), it has been measuring pending alerts firing when the query was run, and the value has always been 1. That would get rendered as: alert ... pending for 1 seconds with labels... regardless of how long the alert had been pending before the query was run. This commit replaces the query with a new one whose value is actually the number of seconds that a currently-pending alert has been firing. The new query: * Uses: ALERTS{...} unless ALERTS offset 1s to trigger on edges where the alert began firing. * Uses last_over_time to find the most recent edge trigger. * Uses: time() - last_over_time(time() * ...) to figure out how long ago that most-recent trigger was. * Multiplies by ALERTS again to exclude alerts that are no longer pending by the time the query runs.

Since pendingAlertQuery landed in 59ef04b (test: Require no alerts during upgrades, 2021-03-17, openshift#25904), it has been measuring pending alerts firing when the query was run, and the value has always been 1. That would get rendered as: alert ... pending for 1 seconds with labels... regardless of how long the alert had been pending before the query was run. This commit replaces the query with a new one whose value is actually the number of seconds that a currently-pending alert has been firing. The new query: * Uses: ALERTS{...} unless ALERTS offset 1s to trigger on edges where the alert began firing. * Uses last_over_time to find the most recent edge trigger. * Uses: time() + 1 - last_over_time(time() * ...) to figure out how long ago that most-recent trigger was. The +1 sets the floor at 1, for cases where one second ago we were not firing, but when the query is run we are firing. * Multiplies by ALERTS again to exclude alerts that are no longer pending by the time the query runs. * Uses sort_desc so we complain first about the alerts that have been pending the longest.

Since pendingAlertQuery landed in 59ef04b (test: Require no alerts during upgrades, 2021-03-17, openshift#25904), it has been measuring pending alerts pending when the query was run, and the value has always been 1. That would get rendered as: alert ... pending for 1 seconds with labels... regardless of how long the alert had been pending before the query was run. This commit replaces the query with a new one whose value is actually the number of seconds that a currently-pending alert has been pending. The new query: * Uses: ALERTS{...} unless ALERTS offset 1s to trigger on edges where the alert began pending. * Uses last_over_time to find the most recent edge trigger. * Uses: time() + 1 - last_over_time(time() * ...) to figure out how long ago that most-recent trigger was. The +1 sets the floor at 1, for cases where one second ago we were not pending, but when the query is run we are pending. * Multiplies by ALERTS again to exclude alerts that are no longer pending by the time the query runs. * Uses sort_desc so we complain first about the alerts that have been pending the longest.

openshift-ci-robot requested review from mrogers950 and soltysh February 17, 2021 23:01

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2021

smarterclayton mentioned this pull request Feb 19, 2021

jsonnet/rules: Mask KubeDeploymentReplicasMismatch alert for upgrade openshift/cluster-monitoring-operator#1065

Merged

2 tasks

smarterclayton mentioned this pull request Feb 22, 2021

test/e2e/upgrade/alert: Extend to testDuration, not just 1m #25912

Closed

smarterclayton force-pushed the alert_tighten branch 4 times, most recently from d2e549b to 9353bfd Compare February 22, 2021 19:46

wking reviewed Feb 22, 2021

View reviewed changes

test/e2e/upgrade/alert/alert.go Show resolved Hide resolved

wking reviewed Feb 22, 2021

View reviewed changes

test/e2e/upgrade/alert/alert.go Outdated Show resolved Hide resolved

wking reviewed Feb 22, 2021

View reviewed changes

test/e2e/upgrade/alert/alert.go Outdated Show resolved Hide resolved

smarterclayton force-pushed the alert_tighten branch from 9353bfd to 85363fe Compare February 22, 2021 22:03

wking reviewed Feb 22, 2021

View reviewed changes

smarterclayton force-pushed the alert_tighten branch 2 times, most recently from 33e96a8 to eb03751 Compare February 23, 2021 19:08

smarterclayton force-pushed the alert_tighten branch 2 times, most recently from 23359d3 to 7a1b7f8 Compare February 24, 2021 19:50

smarterclayton force-pushed the alert_tighten branch from 7a1b7f8 to 9c4fac8 Compare March 13, 2021 23:33

smarterclayton force-pushed the alert_tighten branch from 9c4fac8 to cebf276 Compare March 15, 2021 16:33

smarterclayton force-pushed the alert_tighten branch from cebf276 to 8c16c82 Compare March 16, 2021 16:50

smarterclayton added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Mar 17, 2021

smarterclayton removed the lgtm Indicates that a PR is ready to be merged. label Mar 17, 2021

smarterclayton force-pushed the alert_tighten branch from 8c16c82 to 28d890b Compare March 18, 2021 00:23

smarterclayton force-pushed the alert_tighten branch from 28d890b to 59ef04b Compare March 18, 2021 03:48

smarterclayton merged commit c277e20 into openshift:master Mar 18, 2021

wking mentioned this pull request Jun 15, 2021

test: Use last_over_time in pendingAlertQuery #26233

Merged

test: Require no alerts during upgrades #25904

test: Require no alerts during upgrades #25904

Uh oh!

Conversation

smarterclayton commented Feb 17, 2021

Uh oh!

openshift-ci-robot commented Feb 17, 2021

Uh oh!

smarterclayton commented Feb 19, 2021

Uh oh!

smarterclayton commented Feb 22, 2021

Uh oh!

smarterclayton commented Feb 22, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

smarterclayton commented Feb 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking commented Feb 23, 2021

Uh oh!

smarterclayton commented Feb 23, 2021

Uh oh!

smarterclayton commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smarterclayton commented Feb 23, 2021

Uh oh!

smarterclayton commented Feb 24, 2021

Uh oh!

smarterclayton commented Mar 8, 2021

Uh oh!

smarterclayton commented Mar 9, 2021

Uh oh!

smarterclayton commented Mar 11, 2021

Uh oh!

smarterclayton commented Mar 11, 2021

Uh oh!

smarterclayton commented Mar 11, 2021

Uh oh!

smarterclayton commented Mar 15, 2021

Uh oh!

smarterclayton commented Mar 16, 2021

Uh oh!

smarterclayton commented Mar 17, 2021

Uh oh!

smarterclayton commented Mar 17, 2021

Uh oh!

smarterclayton commented Mar 17, 2021

Uh oh!

smarterclayton commented Mar 18, 2021

Uh oh!

openshift-ci bot commented Mar 18, 2021

Uh oh!

smarterclayton commented Mar 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

smarterclayton commented Feb 23, 2021 •

edited

Loading