-
Notifications
You must be signed in to change notification settings - Fork 4.8k
TRT-1539: Do not let loki alerts fail tests #28627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRT-1539: Do not let loki alerts fail tests #28627
Conversation
We saw yesterday with loki in full outage, this single test would fail reporting an alert firing in openshift-e2e-loki due to daemon set rollout stuck. Promtail pods would run but would never go ready because they couldn't communicate with loki. Filter out any alerts from openshift-e2e-loki in this test so we can pass all tests even when loki is down. We were very close otherwise as no other problems popped up.
|
@dgoodwin: This pull request references TRT-1539 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| tests := map[string]bool{ | ||
| fmt.Sprintf(`ALERTS{alertname!~"%s",alertstate="firing",severity!="info"} >= 1`, strings.Join(allowedAlertNames, "|")): false, | ||
| // openshift-e2e-loki alerts should never fail this test, we've seen this happen on daemon set rollout stuck when CI loki was down. | ||
| fmt.Sprintf(`ALERTS{alertname!~"%s",alertstate="firing",severity!="info",namespace!="openshift-e2e-loki"} >= 1`, strings.Join(allowedAlertNames, "|")): false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have confirmed that namespace!="openshift-e2e-loki" will still match alerts that do not have a namespace with some testing on our alertmanager / prom instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, not having a label is like having it set to empty string.
nit: It was out of date before this change, but if you could adjust the desc (g.It("shouldn't report any alerts in firing state apart...) of the test, it'd be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I did not see that in time. Renaming a test actually requires additional steps outside this repo so that component readiness can continue to compare results across releases, so it's likely more effort than it's worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know, thanks.
|
/lgtm |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, machine424, stbenjam The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/override ci/prow/e2e-aws-ovn-fips Known issue. Not related to this PR. |
|
@dgoodwin: Overrode contexts on behalf of dgoodwin: ci/prow/e2e-aws-ovn-fips, ci/prow/e2e-gcp-ovn DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@dgoodwin: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
[ART PR BUILD NOTIFIER] This PR has been included in build openshift-enterprise-tests-container-v4.16.0-202402291846.p0.gbb34cbc.assembly.stream.el8 for distgit openshift-enterprise-tests. |
We saw yesterday with loki in full outage, this single test would fail
reporting an alert firing in openshift-e2e-loki due to daemon set
rollout stuck. Promtail pods would run but would never go ready because they
couldn't communicate with loki.
Filter out any alerts from openshift-e2e-loki in this test so we can
pass all tests even when loki is down. We were very close otherwise as
no other problems popped up.