-
Notifications
You must be signed in to change notification settings - Fork 4.8k
test: Use last_over_time in pendingAlertQuery #26233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Use last_over_time in pendingAlertQuery #26233
Conversation
a6a4b71 to
aef63ac
Compare
Since pendingAlertQuery landed in 59ef04b (test: Require no alerts during upgrades, 2021-03-17, openshift#25904), it has been measuring pending alerts pending when the query was run, and the value has always been 1. That would get rendered as: alert ... pending for 1 seconds with labels... regardless of how long the alert had been pending before the query was run. This commit replaces the query with a new one whose value is actually the number of seconds that a currently-pending alert has been pending. The new query: * Uses: ALERTS{...} unless ALERTS offset 1s to trigger on edges where the alert began pending. * Uses last_over_time to find the most recent edge trigger. * Uses: time() + 1 - last_over_time(time() * ...) to figure out how long ago that most-recent trigger was. The +1 sets the floor at 1, for cases where one second ago we were not pending, but when the query is run we are pending. * Multiplies by ALERTS again to exclude alerts that are no longer pending by the time the query runs. * Uses sort_desc so we complain first about the alerts that have been pending the longest.
aef63ac to
e38910e
Compare
PromeCIeus shows: from 1:17:17Z to 1:30:16. Checking when the query was run: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/26233/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1404941980663812096/build-log.txt | grep 'Running.*alertstate.*pending'
Jun 16 01:20:40.861: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-vkq5vm52-db044.*********************************:6443 --kubeconfig=/tmp/kubeconfig-491656715 --namespace=e2e-test-check-for-alerts-gxfzz exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer ...' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=%0Asort_desc%28%0A++time%28%29+%2A+ALERTS+%2B+1%0A++-%0A++last_over_time%28%28%0A++++time%28%29+%2A+ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%22%2Calertstate%3D%22pending%22%2Cseverity%21%3D%22info%22%7D%0A++++unless%0A++++ALERTS+offset+1s%0A++%29%5B1h15m23s%3A1s%5D%29%0A%29%0A"'and 1:20:40 is 203 seconds after 1:17:17, so hooray :) |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
12 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
Since
pendingAlertQuerylanded in 59ef04b (#25904), it has been measuring pending alerts when the query was run, and the value has always been 1. That would get rendered as:regardless of how long the alert had been pending before the query was run. This commit replaces the query with a new one whose value is actually the number of seconds that a currently-pending alert has been pending. The new query:
Uses:
to trigger on edges where the alert began pending.
Uses
last_over_timeto find the most recent edge trigger.Uses:
to figure out how long ago that most-recent trigger was. The
+1sets the floor at 1, for cases where one second ago we were not pending, but when the query is run we are pending.Multiplies by
ALERTSagain to exclude alerts that are no longer pending by the time the query runs.Uses
sort_descso we complain first about the alerts that have been pending the longest.