Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jun 15, 2021

Since pendingAlertQuery landed in 59ef04b (#25904), it has been measuring pending alerts when the query was run, and the value has always been 1. That would get rendered as:

alert ... pending for 1 seconds with labels...

regardless of how long the alert had been pending before the query was run. This commit replaces the query with a new one whose value is actually the number of seconds that a currently-pending alert has been pending. The new query:

  • Uses:

      ALERTS{...} unless ALERTS offset 1s
    

    to trigger on edges where the alert began pending.

  • Uses last_over_time to find the most recent edge trigger.

  • Uses:

      time() + 1 - last_over_time(time() * ...)
    

    to figure out how long ago that most-recent trigger was. The +1 sets the floor at 1, for cases where one second ago we were not pending, but when the query is run we are pending.

  • Multiplies by ALERTS again to exclude alerts that are no longer pending by the time the query runs.

  • Uses sort_desc so we complain first about the alerts that have been pending the longest.

@openshift-ci openshift-ci bot requested review from mrogers950 and paulfantom June 15, 2021 23:01
@wking wking force-pushed the pending-alert-times branch 2 times, most recently from a6a4b71 to aef63ac Compare June 15, 2021 23:17
Since pendingAlertQuery landed in 59ef04b (test: Require no alerts
during upgrades, 2021-03-17, openshift#25904), it has been measuring pending
alerts pending when the query was run, and the value has always been 1.
That would get rendered as:

  alert ... pending for 1 seconds with labels...

regardless of how long the alert had been pending before the query was
run.  This commit replaces the query with a new one whose value is
actually the number of seconds that a currently-pending alert has been
pending.  The new query:

* Uses:

    ALERTS{...} unless ALERTS offset 1s

  to trigger on edges where the alert began pending.

* Uses last_over_time to find the most recent edge trigger.
* Uses:

    time() + 1 - last_over_time(time() * ...)

  to figure out how long ago that most-recent trigger was.  The +1
  sets the floor at 1, for cases where one second ago we were not
  pending, but when the query is run we are pending.

* Multiplies by ALERTS again to exclude alerts that are no longer
  pending by the time the query runs.

* Uses sort_desc so we complain first about the alerts that have been
  pending the longest.
@wking wking force-pushed the pending-alert-times branch from aef63ac to e38910e Compare June 15, 2021 23:20
@wking
Copy link
Member Author

wking commented Jun 16, 2021

e2e-gcp-upgrade:

alert KubePodCrashLooping pending for 205.1050000190735 seconds with labels: {container="apiserver-watcher", endpoint="https-main", job="kube-state-metrics", namespace="kube-system", pod="apiserver-watcher-ci-op-vkq5vm52-db044-j9jtn-master-2", service="kube-state-metrics", severity="warning"}
...

PromeCIeus shows:

ALERTS{container="apiserver-watcher", endpoint="https-main", job="kube-state-metrics", namespace="kube-system", pod="apiserver-watcher-ci-op-vkq5vm52-db044-j9jtn-master-2", service="kube-state-metrics", severity="warning"}

from 1:17:17Z to 1:30:16. Checking when the query was run:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/26233/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1404941980663812096/build-log.txt | grep 'Running.*alertstate.*pending'
Jun 16 01:20:40.861: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-vkq5vm52-db044.*********************************:6443 --kubeconfig=/tmp/kubeconfig-491656715 --namespace=e2e-test-check-for-alerts-gxfzz exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer ...' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=%0Asort_desc%28%0A++time%28%29+%2A+ALERTS+%2B+1%0A++-%0A++last_over_time%28%28%0A++++time%28%29+%2A+ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%22%2Calertstate%3D%22pending%22%2Cseverity%21%3D%22info%22%7D%0A++++unless%0A++++ALERTS+offset+1s%0A++%29%5B1h15m23s%3A1s%5D%29%0A%29%0A"'

and 1:20:40 is 203 seconds after 1:17:17, so hooray :)

@smarterclayton
Copy link
Contributor

/lgtm
/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 16, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 16, 2021
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

12 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 1b2795a into openshift:master Jun 18, 2021
@wking wking deleted the pending-alert-times branch November 18, 2021 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants