-
Notifications
You must be signed in to change notification settings - Fork 220
Bug 1835025: TestRouteAdmissionPolicy: Fix wait for deployment #400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1835025: TestRouteAdmissionPolicy: Fix wait for deployment #400
Conversation
|
@Miciah: This pull request references Bugzilla bug 1835025, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test e2e-aws |
1 similar comment
|
/test e2e-aws |
4faecca to
520060b
Compare
|
/test e2e-aws-operator |
520060b to
f016df2
Compare
|
/test e2e-aws-operator |
/test e2e-aws-operator |
|
Failed to provision the cluster: /test e2e-aws-operator |
|
The test passed. Let's see whether it passes consistently. /test e2e-aws-operator |
|
Passed again. /test e2e-aws-operator |
|
@danehans, the test has passed 3 times in a row now, and the earlier e2e-aws-operator failures since the last push were caused by other tests or provisioning failures, so I am feeling fairly confident that it resolves the problem. Could you take another look please? |
| deployment = &appsv1.Deployment{} | ||
| err := wait.PollImmediate(1*time.Second, timeout, func() (bool, error) { | ||
| if err := cl.Get(context.TODO(), name, deployment); err != nil { | ||
| return false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should false, err be returned here, instead of false, nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, yeah. Generally, I think we should either t.Logf(..., err) or return false, err. In this case, the deployment should already exist, and we're only watching for a mutation, so using return false, err would make sense; on the other hand, doing so would make the test more likely to break if there are transient network or API issues. Would it make most sense in general to t.Logf(..., err) and return false, nil to decrease the likelihood that tests fail due to networking or API failures unrelated to the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://bugzilla.redhat.com/show_bug.cgi?id=1828618 is related to the general question (maybe it prompted your comment here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that BZ definitely prompted my comment.
I definitely agree that we should either log the error responsibly and return false, nil or just flat out fail by returning return false, err in these situations.
To counter your point about transient network or API issues, if any of these issues errors were to persist for a length of time close to the timeout, returning false, err and failing the test immediately would be beneficial. Additionally, parts of the ingress operator's e2e test already fail on unsuccessful API calls (for example, https://github.com/openshift/cluster-ingress-operator/blob/master/test/e2e/operator_test.go#L193). So I'm wondering how much we would really be decreasing the likelihood of the tests failing from transient issues by using the responsible logging and return false, nil strategy. For this reason, I am leaning towards turning false, err and outright failing. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To counter your point about transient network or API issues, if any of these issues errors were to persist for a length of time close to the timeout, returning
false, errand failing the test immediately would be beneficial.
I disagree—if networking or API is broken, other tests should report that breakage. Having this component's tests fail due to other components' breakage makes it more difficult to determine which component is at fault and leads to misdirected Bugzilla reports.
Additionally, parts of the ingress operator's e2e test already fail on unsuccessful API calls (for example, https://github.com/openshift/cluster-ingress-operator/blob/master/test/e2e/operator_test.go#L193). So I'm wondering how much we would really be decreasing the likelihood of the tests failing from transient issues by using the responsible logging and return
false, nilstrategy. For this reason, I am leaning towards turningfalse, errand outright failing. Thoughts?
That is why I am speaking generally—I believe this component's tests should not fail on errors that are caused by other components, so I am suggesting that as a general principle, we should log and retry on errors that are not caused by our component (and transient errors are very unlikely to be caused by our component). Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does make sense, and I think you've convinced me. I agree that logging and retrying transient errors, when applicable, makes the most sense. Let's include this conversation in https://bugzilla.redhat.com/show_bug.cgi?id=1828618 and continue the conversation there. Would like to here the rest of the teams thoughts if thats OK too.
Follow-up to commit 17c6350. * test/e2e/operator_test.go (TestRouteAdmissionPolicy): Use waitForDeploymentComplete instead of waitForIngressControllerCondition. (waitForDeploymentComplete): New function. Wait for the given deployment to complete its rollout.
f016df2 to
16ddfad
Compare
|
Failed to launch the cluster. /test e2e-aws-operator |
I've seen this with other PRs recently. /test e2e-aws-upgrade |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danehans, Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@Miciah: All pull requests linked via external trackers have merged: openshift/cluster-ingress-operator#400, openshift/cluster-ingress-operator#396. Bugzilla bug 1835025 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Follow-up to #396.
test/e2e/operator_test.go(TestRouteAdmissionPolicy): UsewaitForDeploymentCompleteinstead ofwaitForIngressControllerCondition.(
waitForDeploymentComplete): New function. Wait for the given deployment to complete its rollout.