-
Notifications
You must be signed in to change notification settings - Fork 4.8k
test/extended/prometheus: Add test for no alerts in pending state #25112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @openshift/openshift-team-monitoring |
|
Can we add this to the upgrade suite too? |
I would start with this first, and if it works we can do in a separate PR. sgty? I want to give this test some runs to make sure its not going to be flaky. |
|
Yeah, we could punt. But we want the logic in a helper function, so we can easily use it in both places, right? |
It's a promql query, not sure it needs helper function for it? |
|
It's presumably going to end up with an exclusion list, right? |
|
Never mind, looks like the precedent set by #24786 is to have separate PromQL for upgrade and non-upgrade tests. So I'm +1 to this as it stands (assuming CI doesn't turn up any existing pending alerts, as you said earlier). |
| }() | ||
|
|
||
| tests := map[string]bool{ | ||
| `count_over_time(ALERTS{alertstate="pending"}[2h])`: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to exclude at least the same alerts as in https://github.com/openshift/origin/blob/master/test/extended/prometheus/prometheus.go#L69
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lilic The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Because our tests run only for short amount of time they might not catch the alerts that start firing as some components get started later in the cluster run, might only have alerts in pending state. This test hopes to catch those alerts that would potentially be in firing state. But also out of the box we should not have any, ideally, alerts in pending or firing states.
| e2e.Logf("Watchdog alert is firing") | ||
| }) | ||
|
|
||
| g.It("should not have any alerts in pending state the entire cluster run", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move it closer https://github.com/openshift/origin/blob/master/test/extended/prometheus/prometheus.go#L54 as those tests are very similar and need almost the same exclusion list?
|
@lilic: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
| }() | ||
|
|
||
| tests := map[string]bool{ | ||
| `count_over_time(ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh", alertstate="pending"}[2h])`: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need the alertstate filter here? I'd rather have:
`count_over_time(ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh"}[2h])`: falsefor "there are no surprising alerts in any state over the entire cluster run", to ensure we don't have something like a pending alert that slips through the firing test and then starts firing and slips through a pending-specific test.
|
Closing as it's not as easy as first thought as some alerts can be in pending state due to the nature of our e2e tests but don't eventually fire, so the list of allowing those would be too long. The general idea was to catch any alerts that are in pending state but eventually start firing, which sometimes our current tests miss, but it's not as easy. If someone manages to come up with an idea to make sure we catch all firing alerts, happy to review! /close |
|
@lilic: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Because our tests run only for short amount of time they might not
catch the alerts that start firing as some components get started
later in the cluster run, might only have alerts in pending state. This
test hopes to catch those alerts that would potentially be in firing
state. But also out of the box we should not have any, ideally, alerts
in pending or firing states.
Lets see if this passes, if not will open bugzilla for each alert in pending state and add it to temp allowlist.
cc @wking