Skip to content

Conversation

@lilic
Copy link
Contributor

@lilic lilic commented Jun 15, 2020

Because our tests run only for short amount of time they might not
catch the alerts that start firing as some components get started
later in the cluster run, might only have alerts in pending state. This
test hopes to catch those alerts that would potentially be in firing
state. But also out of the box we should not have any, ideally, alerts
in pending or firing states.

Lets see if this passes, if not will open bugzilla for each alert in pending state and add it to temp allowlist.

cc @wking

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2020
@lilic
Copy link
Contributor Author

lilic commented Jun 15, 2020

cc @openshift/openshift-team-monitoring

@wking
Copy link
Member

wking commented Jun 15, 2020

Can we add this to the upgrade suite too?

@lilic
Copy link
Contributor Author

lilic commented Jun 15, 2020

Can we add this to the upgrade suite too?

I would start with this first, and if it works we can do in a separate PR. sgty? I want to give this test some runs to make sure its not going to be flaky.

@wking
Copy link
Member

wking commented Jun 15, 2020

Yeah, we could punt. But we want the logic in a helper function, so we can easily use it in both places, right?

@lilic
Copy link
Contributor Author

lilic commented Jun 15, 2020

But we want the logic in a helper function, so we can easily use it in both places, right?

It's a promql query, not sure it needs helper function for it?

@wking
Copy link
Member

wking commented Jun 15, 2020

It's presumably going to end up with an exclusion list, right?

@wking
Copy link
Member

wking commented Jun 15, 2020

Never mind, looks like the precedent set by #24786 is to have separate PromQL for upgrade and non-upgrade tests. So I'm +1 to this as it stands (assuming CI doesn't turn up any existing pending alerts, as you said earlier).

}()

tests := map[string]bool{
`count_over_time(ALERTS{alertstate="pending"}[2h])`: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lilic
To complete the pull request process, please assign smarterclayton
You can assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2020
lilic added 2 commits June 15, 2020 17:13
Because our tests run only for short amount of time they might not
catch the alerts that start firing as some components get started
later in the cluster run, might only have alerts in pending state. This
test hopes to catch those alerts that would potentially be in firing
state. But also out of the box we should not have any, ideally, alerts
in pending or firing states.
@lilic lilic force-pushed the no-pending-alerts branch from 92253a8 to 46e18b8 Compare June 15, 2020 15:13
e2e.Logf("Watchdog alert is firing")
})

g.It("should not have any alerts in pending state the entire cluster run", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move it closer https://github.com/openshift/origin/blob/master/test/extended/prometheus/prometheus.go#L54 as those tests are very similar and need almost the same exclusion list?

@openshift-ci-robot
Copy link

@lilic: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/unit 46e18b8 link /test unit
ci/prow/e2e-gcp 46e18b8 link /test e2e-gcp
ci/prow/e2e-aws-fips 46e18b8 link /test e2e-aws-fips

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

}()

tests := map[string]bool{
`count_over_time(ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh", alertstate="pending"}[2h])`: false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need the alertstate filter here? I'd rather have:

`count_over_time(ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh"}[2h])`: false

for "there are no surprising alerts in any state over the entire cluster run", to ensure we don't have something like a pending alert that slips through the firing test and then starts firing and slips through a pending-specific test.

@lilic
Copy link
Contributor Author

lilic commented Oct 6, 2020

Closing as it's not as easy as first thought as some alerts can be in pending state due to the nature of our e2e tests but don't eventually fire, so the list of allowing those would be too long. The general idea was to catch any alerts that are in pending state but eventually start firing, which sometimes our current tests miss, but it's not as easy. If someone manages to come up with an idea to make sure we catch all firing alerts, happy to review!

/close

@openshift-ci-robot
Copy link

@lilic: Closed this PR.

Details

In response to this:

Closing as it's not as easy as first thought as some alerts can be in pending state due to the nature of our e2e tests but don't eventually fire, so the list of allowing those would be too long. The general idea was to catch any alerts that are in pending state but eventually start firing, which sometimes our current tests miss, but it's not as easy. If someone manages to come up with an idea to make sure we catch all firing alerts, happy to review!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants