Skip to content

Conversation

@dgoodwin
Copy link
Contributor

We saw yesterday with loki in full outage, this single test would fail
reporting an alert firing in openshift-e2e-loki due to daemon set
rollout stuck. Promtail pods would run but would never go ready because they
couldn't communicate with loki.

Filter out any alerts from openshift-e2e-loki in this test so we can
pass all tests even when loki is down. We were very close otherwise as
no other problems popped up.

We saw yesterday with loki in full outage, this single test would fail
reporting an alert firing in openshift-e2e-loki due to daemon set
rollout stuck. Promtail pods would run but would never go ready because they
couldn't communicate with loki.

Filter out any alerts from openshift-e2e-loki in this test so we can
pass all tests even when loki is down. We were very close otherwise as
no other problems popped up.
@dgoodwin dgoodwin changed the title Do not let loki alerts fail tests TRT-1539: Do not let loki alerts fail tests Feb 28, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 28, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 28, 2024

@dgoodwin: This pull request references TRT-1539 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

Details

In response to this:

We saw yesterday with loki in full outage, this single test would fail
reporting an alert firing in openshift-e2e-loki due to daemon set
rollout stuck. Promtail pods would run but would never go ready because they
couldn't communicate with loki.

Filter out any alerts from openshift-e2e-loki in this test so we can
pass all tests even when loki is down. We were very close otherwise as
no other problems popped up.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from jan--f and machine424 February 28, 2024 12:31
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 28, 2024
tests := map[string]bool{
fmt.Sprintf(`ALERTS{alertname!~"%s",alertstate="firing",severity!="info"} >= 1`, strings.Join(allowedAlertNames, "|")): false,
// openshift-e2e-loki alerts should never fail this test, we've seen this happen on daemon set rollout stuck when CI loki was down.
fmt.Sprintf(`ALERTS{alertname!~"%s",alertstate="firing",severity!="info",namespace!="openshift-e2e-loki"} >= 1`, strings.Join(allowedAlertNames, "|")): false,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have confirmed that namespace!="openshift-e2e-loki" will still match alerts that do not have a namespace with some testing on our alertmanager / prom instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, not having a label is like having it set to empty string.
nit: It was out of date before this change, but if you could adjust the desc (g.It("shouldn't report any alerts in firing state apart...) of the test, it'd be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I did not see that in time. Renaming a test actually requires additional steps outside this repo so that component readiness can continue to compare results across releases, so it's likely more effort than it's worth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know, thanks.

@stbenjam
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 28, 2024
@machine424
Copy link
Contributor

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 28, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, machine424, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 6ec17d7 and 2 for PR HEAD e38745d in total

@dgoodwin
Copy link
Contributor Author

/retest

@dgoodwin
Copy link
Contributor Author

/override ci/prow/e2e-aws-ovn-fips
/override ci/prow/e2e-gcp-ovn

Known issue. Not related to this PR.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 29, 2024

@dgoodwin: Overrode contexts on behalf of dgoodwin: ci/prow/e2e-aws-ovn-fips, ci/prow/e2e-gcp-ovn

Details

In response to this:

/override ci/prow/e2e-aws-ovn-fips
/override ci/prow/e2e-gcp-ovn

Known issue. Not related to this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 29, 2024

@dgoodwin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade e38745d link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn-single-node e38745d link false /test e2e-aws-ovn-single-node
ci/prow/e2e-openstack-ovn e38745d link false /test e2e-openstack-ovn
ci/prow/e2e-metal-ipi-sdn e38745d link false /test e2e-metal-ipi-sdn
ci/prow/e2e-aws-ovn-single-node-serial e38745d link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn-cgroupsv2 e38745d link false /test e2e-aws-ovn-cgroupsv2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit bb34cbc into openshift:master Feb 29, 2024
@dgoodwin dgoodwin deleted the ignore-loki-alerts branch February 29, 2024 13:51
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build openshift-enterprise-tests-container-v4.16.0-202402291846.p0.gbb34cbc.assembly.stream.el8 for distgit openshift-enterprise-tests.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants