Skip to content

Conversation

@Miciah
Copy link
Contributor

@Miciah Miciah commented May 19, 2020

Follow-up to #396.

  • test/e2e/operator_test.go (TestRouteAdmissionPolicy): Use waitForDeploymentComplete instead of waitForIngressControllerCondition.
    (waitForDeploymentComplete): New function. Wait for the given deployment to complete its rollout.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 19, 2020
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Bugzilla bug 1835025, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1835025: TestRouteAdmissionPolicy: Fix wait for deployment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2020
@Miciah
Copy link
Contributor Author

Miciah commented May 19, 2020

/test e2e-aws

1 similar comment
@Miciah
Copy link
Contributor Author

Miciah commented May 19, 2020

/test e2e-aws

@Miciah Miciah force-pushed the BZ1835025-TestRouteAdmissionPolicy-fix-wait-for-deployment branch from 4faecca to 520060b Compare May 20, 2020 21:40
@danehans
Copy link
Contributor

danehans commented Jun 4, 2020

e2e-aws-operator fails due to:

--- FAIL: TestInternalLoadBalancer (91.04s)
    operator_test.go:566: failed to observe expected conditions: Get https://api.ci-op-7sg17nf6-43abb.origin-ci-int-aws.dev.rhcloud.com:6443/apis/operator.openshift.io/v1/namespaces/openshift-ingress

/test e2e-aws-operator

@Miciah Miciah force-pushed the BZ1835025-TestRouteAdmissionPolicy-fix-wait-for-deployment branch from 520060b to f016df2 Compare June 8, 2020 00:59
@Miciah
Copy link
Contributor Author

Miciah commented Jun 8, 2020

TestRouteHTTP2EnableAndDisableIngressController timed out because the default ingresscontroller never became available because the Route 53 API was throttled:

The DNS provider failed to ensure the record: failed to update alias in zone Z09357351CVW2O185N0TP: couldn't update DNS record in zone Z09357351CVW2O185N0TP: Throttling: Rate exceeded

/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Jun 8, 2020

  * could not run steps: step e2e-aws-operator failed: failed to acquire lease: resources not found

/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Jun 8, 2020

Failed to provision the cluster:

Error: Unable to find matching route for Route Table (rtb-0182f841594046c3b) and destination CIDR block (0.0.0.0/0).

/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Jun 9, 2020

The test passed. Let's see whether it passes consistently.

/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Jun 9, 2020

Passed again.

/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Jun 10, 2020

@danehans, the test has passed 3 times in a row now, and the earlier e2e-aws-operator failures since the last push were caused by other tests or provisioning failures, so I am feeling fairly confident that it resolves the problem. Could you take another look please?

deployment = &appsv1.Deployment{}
err := wait.PollImmediate(1*time.Second, timeout, func() (bool, error) {
if err := cl.Get(context.TODO(), name, deployment); err != nil {
return false, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should false, err be returned here, instead of false, nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, yeah. Generally, I think we should either t.Logf(..., err) or return false, err. In this case, the deployment should already exist, and we're only watching for a mutation, so using return false, err would make sense; on the other hand, doing so would make the test more likely to break if there are transient network or API issues. Would it make most sense in general to t.Logf(..., err) and return false, nil to decrease the likelihood that tests fail due to networking or API failures unrelated to the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://bugzilla.redhat.com/show_bug.cgi?id=1828618 is related to the general question (maybe it prompted your comment here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that BZ definitely prompted my comment.

I definitely agree that we should either log the error responsibly and return false, nil or just flat out fail by returning return false, err in these situations.

To counter your point about transient network or API issues, if any of these issues errors were to persist for a length of time close to the timeout, returning false, err and failing the test immediately would be beneficial. Additionally, parts of the ingress operator's e2e test already fail on unsuccessful API calls (for example, https://github.com/openshift/cluster-ingress-operator/blob/master/test/e2e/operator_test.go#L193). So I'm wondering how much we would really be decreasing the likelihood of the tests failing from transient issues by using the responsible logging and return false, nil strategy. For this reason, I am leaning towards turning false, err and outright failing. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To counter your point about transient network or API issues, if any of these issues errors were to persist for a length of time close to the timeout, returning false, err and failing the test immediately would be beneficial.

I disagree—if networking or API is broken, other tests should report that breakage. Having this component's tests fail due to other components' breakage makes it more difficult to determine which component is at fault and leads to misdirected Bugzilla reports.

Additionally, parts of the ingress operator's e2e test already fail on unsuccessful API calls (for example, https://github.com/openshift/cluster-ingress-operator/blob/master/test/e2e/operator_test.go#L193). So I'm wondering how much we would really be decreasing the likelihood of the tests failing from transient issues by using the responsible logging and return false, nil strategy. For this reason, I am leaning towards turning false, err and outright failing. Thoughts?

That is why I am speaking generally—I believe this component's tests should not fail on errors that are caused by other components, so I am suggesting that as a general principle, we should log and retry on errors that are not caused by our component (and transient errors are very unlikely to be caused by our component). Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does make sense, and I think you've convinced me. I agree that logging and retrying transient errors, when applicable, makes the most sense. Let's include this conversation in https://bugzilla.redhat.com/show_bug.cgi?id=1828618 and continue the conversation there. Would like to here the rest of the teams thoughts if thats OK too.

Follow-up to commit 17c6350.

* test/e2e/operator_test.go (TestRouteAdmissionPolicy): Use
waitForDeploymentComplete instead of waitForIngressControllerCondition.
(waitForDeploymentComplete): New function.  Wait for the given deployment
to complete its rollout.
@Miciah Miciah force-pushed the BZ1835025-TestRouteAdmissionPolicy-fix-wait-for-deployment branch from f016df2 to 16ddfad Compare June 18, 2020 20:21
@Miciah
Copy link
Contributor Author

Miciah commented Jun 18, 2020

Failed to launch the cluster.

/test e2e-aws-operator

@danehans
Copy link
Contributor

fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:138]: during upgrade to registry.svc.ci.openshift.org/ci-op-3ht8lmy3/release@sha256:aed90ddf207d20929117ab1d13151e712d0281e899631bc703c656cc44f1fdad
Unexpected error:
    <*errors.errorString | 0xc002f35780>: {
        s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver",
    }
    Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver
occurred

I've seen this with other PRs recently.

/test e2e-aws-upgrade

@danehans
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 25, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danehans, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 8e67623 into openshift:master Jun 26, 2020
@openshift-ci-robot
Copy link
Contributor

@Miciah: All pull requests linked via external trackers have merged: openshift/cluster-ingress-operator#400, openshift/cluster-ingress-operator#396. Bugzilla bug 1835025 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1835025: TestRouteAdmissionPolicy: Fix wait for deployment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants