Bug 1835025: TestRouteAdmissionPolicy: Fix wait for deployment #400

Miciah · 2020-05-19T20:43:46Z

Follow-up to #396.

test/e2e/operator_test.go (TestRouteAdmissionPolicy): Use waitForDeploymentComplete instead of waitForIngressControllerCondition.
(waitForDeploymentComplete): New function. Wait for the given deployment to complete its rollout.

openshift-ci-robot · 2020-05-19T20:43:51Z

@Miciah: This pull request references Bugzilla bug 1835025, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

Bug 1835025: TestRouteAdmissionPolicy: Fix wait for deployment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Miciah · 2020-05-19T22:18:33Z

/test e2e-aws

Miciah · 2020-05-19T23:54:10Z

/test e2e-aws

danehans · 2020-06-04T17:07:47Z

e2e-aws-operator fails due to:

--- FAIL: TestInternalLoadBalancer (91.04s)
    operator_test.go:566: failed to observe expected conditions: Get https://api.ci-op-7sg17nf6-43abb.origin-ci-int-aws.dev.rhcloud.com:6443/apis/operator.openshift.io/v1/namespaces/openshift-ingress

/test e2e-aws-operator

Miciah · 2020-06-08T14:16:17Z

TestRouteHTTP2EnableAndDisableIngressController timed out because the default ingresscontroller never became available because the Route 53 API was throttled:

The DNS provider failed to ensure the record: failed to update alias in zone Z09357351CVW2O185N0TP: couldn't update DNS record in zone Z09357351CVW2O185N0TP: Throttling: Rate exceeded

/test e2e-aws-operator

Miciah · 2020-06-08T19:33:53Z

  * could not run steps: step e2e-aws-operator failed: failed to acquire lease: resources not found

/test e2e-aws-operator

Miciah · 2020-06-08T22:28:45Z

Failed to provision the cluster:

Error: Unable to find matching route for Route Table (rtb-0182f841594046c3b) and destination CIDR block (0.0.0.0/0).

/test e2e-aws-operator

Miciah · 2020-06-09T16:43:29Z

The test passed. Let's see whether it passes consistently.

/test e2e-aws-operator

Miciah · 2020-06-09T18:06:54Z

Passed again.

/test e2e-aws-operator

Miciah · 2020-06-10T00:22:34Z

@danehans, the test has passed 3 times in a row now, and the earlier e2e-aws-operator failures since the last push were caused by other tests or provisioning failures, so I am feeling fairly confident that it resolves the problem. Could you take another look please?

sgreene570 · 2020-06-18T14:07:33Z

test/e2e/operator_test.go

+	deployment = &appsv1.Deployment{}
+	err := wait.PollImmediate(1*time.Second, timeout, func() (bool, error) {
+		if err := cl.Get(context.TODO(), name, deployment); err != nil {
+			return false, nil


Should false, err be returned here, instead of false, nil?

Probably, yeah. Generally, I think we should either t.Logf(..., err) or return false, err. In this case, the deployment should already exist, and we're only watching for a mutation, so using return false, err would make sense; on the other hand, doing so would make the test more likely to break if there are transient network or API issues. Would it make most sense in general to t.Logf(..., err) and return false, nil to decrease the likelihood that tests fail due to networking or API failures unrelated to the test?

https://bugzilla.redhat.com/show_bug.cgi?id=1828618 is related to the general question (maybe it prompted your comment here).

Yup, that BZ definitely prompted my comment.

I definitely agree that we should either log the error responsibly and return false, nil or just flat out fail by returning return false, err in these situations.

To counter your point about transient network or API issues, if any of these issues errors were to persist for a length of time close to the timeout, returning false, err and failing the test immediately would be beneficial. Additionally, parts of the ingress operator's e2e test already fail on unsuccessful API calls (for example, https://github.com/openshift/cluster-ingress-operator/blob/master/test/e2e/operator_test.go#L193). So I'm wondering how much we would really be decreasing the likelihood of the tests failing from transient issues by using the responsible logging and return false, nil strategy. For this reason, I am leaning towards turning false, err and outright failing. Thoughts?

To counter your point about transient network or API issues, if any of these issues errors were to persist for a length of time close to the timeout, returning false, err and failing the test immediately would be beneficial.

I disagree—if networking or API is broken, other tests should report that breakage. Having this component's tests fail due to other components' breakage makes it more difficult to determine which component is at fault and leads to misdirected Bugzilla reports.

Additionally, parts of the ingress operator's e2e test already fail on unsuccessful API calls (for example, https://github.com/openshift/cluster-ingress-operator/blob/master/test/e2e/operator_test.go#L193). So I'm wondering how much we would really be decreasing the likelihood of the tests failing from transient issues by using the responsible logging and return false, nil strategy. For this reason, I am leaning towards turning false, err and outright failing. Thoughts?

That is why I am speaking generally—I believe this component's tests should not fail on errors that are caused by other components, so I am suggesting that as a general principle, we should log and retry on errors that are not caused by our component (and transient errors are very unlikely to be caused by our component). Does that make sense?

That does make sense, and I think you've convinced me. I agree that logging and retrying transient errors, when applicable, makes the most sense. Let's include this conversation in https://bugzilla.redhat.com/show_bug.cgi?id=1828618 and continue the conversation there. Would like to here the rest of the teams thoughts if thats OK too.

Follow-up to commit 17c6350. * test/e2e/operator_test.go (TestRouteAdmissionPolicy): Use waitForDeploymentComplete instead of waitForIngressControllerCondition. (waitForDeploymentComplete): New function. Wait for the given deployment to complete its rollout.

Miciah · 2020-06-18T21:55:53Z

Failed to launch the cluster.

/test e2e-aws-operator

danehans · 2020-06-25T23:17:03Z

fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:138]: during upgrade to registry.svc.ci.openshift.org/ci-op-3ht8lmy3/release@sha256:aed90ddf207d20929117ab1d13151e712d0281e899631bc703c656cc44f1fdad
Unexpected error:
    <*errors.errorString | 0xc002f35780>: {
        s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver",
    }
    Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver
occurred

I've seen this with other PRs recently.

/test e2e-aws-upgrade

danehans · 2020-06-25T23:18:51Z

/lgtm

openshift-ci-robot · 2020-06-25T23:19:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danehans, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah,danehans]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-06-26T02:51:04Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-26T05:00:47Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-26T05:13:43Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-06-26T08:12:32Z

@Miciah: All pull requests linked via external trackers have merged: openshift/cluster-ingress-operator#400, openshift/cluster-ingress-operator#396. Bugzilla bug 1835025 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1835025: TestRouteAdmissionPolicy: Fix wait for deployment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 19, 2020

openshift-ci-robot requested review from danehans and ironcladlou May 19, 2020 20:44

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2020

Miciah force-pushed the BZ1835025-TestRouteAdmissionPolicy-fix-wait-for-deployment branch from 4faecca to 520060b Compare May 20, 2020 21:40

Miciah force-pushed the BZ1835025-TestRouteAdmissionPolicy-fix-wait-for-deployment branch from 520060b to f016df2 Compare June 8, 2020 00:59

This was referenced Jun 10, 2020

Implement HTTP Forwarded header policy API #410

Merged

controller: Make ensure conditions consistent #408

Merged

sgreene570 reviewed Jun 18, 2020

View reviewed changes

Miciah force-pushed the BZ1835025-TestRouteAdmissionPolicy-fix-wait-for-deployment branch from f016df2 to 16ddfad Compare June 18, 2020 20:21

Miciah mentioned this pull request Jun 24, 2020

Move DNS provider initialization to DNS controller #417

Merged

openshift-ci-robot assigned danehans Jun 25, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 25, 2020

openshift-merge-robot merged commit 8e67623 into openshift:master Jun 26, 2020

Bug 1835025: TestRouteAdmissionPolicy: Fix wait for deployment #400

Bug 1835025: TestRouteAdmissionPolicy: Fix wait for deployment #400

Uh oh!

Conversation

Miciah commented May 19, 2020

Uh oh!

openshift-ci-robot commented May 19, 2020

Uh oh!

Miciah commented May 19, 2020

Uh oh!

Miciah commented May 19, 2020

Uh oh!

danehans commented Jun 4, 2020

Uh oh!

Miciah commented Jun 8, 2020

Uh oh!

Miciah commented Jun 8, 2020

Uh oh!

Miciah commented Jun 8, 2020

Uh oh!

Miciah commented Jun 9, 2020

Uh oh!

Miciah commented Jun 9, 2020

Uh oh!

Miciah commented Jun 10, 2020

Uh oh!

sgreene570 Jun 18, 2020

Choose a reason for hiding this comment

Uh oh!

Miciah Jun 18, 2020

Choose a reason for hiding this comment

Uh oh!

Miciah Jun 18, 2020

Choose a reason for hiding this comment

Uh oh!

sgreene570 Jun 18, 2020

Choose a reason for hiding this comment

Uh oh!

Miciah Jun 18, 2020

Choose a reason for hiding this comment

Uh oh!

sgreene570 Jun 18, 2020

Choose a reason for hiding this comment

Uh oh!

Miciah commented Jun 18, 2020

Uh oh!

danehans commented Jun 25, 2020

Uh oh!

danehans commented Jun 25, 2020

Uh oh!

openshift-ci-robot commented Jun 25, 2020

Uh oh!

openshift-bot commented Jun 26, 2020

Uh oh!

openshift-bot commented Jun 26, 2020

Uh oh!

openshift-bot commented Jun 26, 2020

Uh oh!

openshift-ci-robot commented Jun 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants