add Eventually() to retryable k8s E2E operations#2123
Conversation
6bb157a to
3b16de9
Compare
3b16de9 to
1eb9bad
Compare
|
/test ls |
|
@jackfrancis: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test pull-cluster-api-provider-azure-e2e-optional |
1eb9bad to
e2c6631
Compare
|
/test pull-cluster-api-provider-azure-e2e-optional |
e2c6631 to
9c57bbf
Compare
|
/test pull-cluster-api-provider-azure-e2e-optional |
|
/assign @CecileRobertMichon |
3a00057 to
536440a
Compare
|
/retest |
536440a to
5199f80
Compare
|
/test pull-cluster-api-provider-azure-e2e-optional |
|
/retest |
bc4774f to
50cf7c9
Compare
| return helper.Patch(ctx, owningMachinePool) | ||
| }) | ||
|
|
||
| By(fmt.Sprintf("checking for a machine to start draining for machine pool: %s/%s", amp.Namespace, amp.Name)) |
There was a problem hiding this comment.
This test was never actually running, because the Eventually() was not actually running according to any success criteria. When I enabled the test using a retry + success criteria, it was failing consistently. After investigation, it seems pretty clear that evaluating for "started but hasn't yet finished draining" is a small enough window that it will regularly fail. Let's just get rid of this altogether and run the next test ("check for successful drain").
See https://kubernetes.slack.com/archives/CEX9HENG7/p1646686318723399
There was a problem hiding this comment.
Actually it's the "finished draining but hasn't yet deleted" window that is too small.
|
/test pull-cluster-api-provider-azure-e2e-optional |
731692c to
8aeef17
Compare
|
|
||
| return errors.New("no machine has finished draining") | ||
| }) | ||
| // TODO setup a watcher to detect the terminal drain success state |
There was a problem hiding this comment.
Rather than checking for the expected state transitions serially, we can improve this in the future by setting a watch on the machine in the process of deleting. I'm making the judgment that doing that work is out of scope for this PR (this PR is about standardizing the usage of Eventually() for retryable operations).
Bottom line, this test coverage was never actually running due to the lack of .Should(Succeed()), so adding the "validate that drain begins after delete" is already net additive coverage.
There was a problem hiding this comment.
can we open an issue for this rather than leaving a TODO in the code? Also does this mean #2120 isn't fully fixed?
There was a problem hiding this comment.
The referenced issue documents a flake that should be addressed w/ the new Eventually() block in L191 below
|
/test pull-cluster-api-provider-azure-e2e-optional |
|
/retest |
| if err := input.Getter.Get(ctx, types.NamespacedName{Namespace: input.Namespace, Name: ref.Name}, | ||
| ownerMachinePool); err != nil { | ||
| Logf("Failed to get machinePool: %+v", err) | ||
| LogWarningf("Failed to get machinePool: %+v", err) |
There was a problem hiding this comment.
this seems like an actual failure and it's not in a retry block, why log it as a warning?
There was a problem hiding this comment.
There is no existent error equivalent of Logf Do we want to create one?
There was a problem hiding this comment.
I see, Logf is actually Info which is even less acurrate. I thought we were decreasing severity.
| err = servicesClient.Delete(ctx, ilbService.Name, metav1.DeleteOptions{}) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
| Eventually(func() error { | ||
| err := servicesClient.Delete(ctx, ilbService.Name, metav1.DeleteOptions{}) |
There was a problem hiding this comment.
based on our previous conversation in slack, my understanding was that this Delete doesn't actually wait for the service to be deleted, it just waits for the client go Delete to return, which doesn't fully wait for the service to be gone. Is that correct?
If so, waiting 20 minutes doesn't seem appropriate. We should change it to actually wait for the service to be gone so that we avoid running into flakes later due to cloud provider being in the middle of reconciling the deleted service when we try to create a new service.
There was a problem hiding this comment.
It's actually not clear (to me) in the client-go code if the delete operation blocks on success here...
I don't see a "wait" or "do not wait" option here
There was a problem hiding this comment.
Based on the timing in logs it seems very likely that the delete is not blocking:
Mar 9 20:30:50.686: INFO: job default/curl-to-ilb-joboupqs is complete, took 10.069781536s
�[1mSTEP�[0m: deleting the ilb test resources
Mar 9 20:30:50.686: INFO: deleting the ilb service: webg7snfx-ilb
Mar 9 20:30:50.741: INFO: deleting the ilb job: curl-to-ilb-joboupqs
�[1mSTEP�[0m: creating an external Load Balancer service
Mar 9 20:30:50.776: INFO: starting to create an external Load Balancer service
We should add a check to make sure the service is gone before we proceed. We don't have to fix it in this PR but let's at least reduce the timeout, 20 minutes is too long if it's just for the action of starting the delete (and not the delete actually completing).
| }, waitForDrainOperationTimeout, waitForDrainSleepBetweenRetries).Should(Succeed()) | ||
|
|
||
| for _, machine := range ampmls { | ||
| if conditions.Has(&machine, clusterv1.DrainingSucceededCondition) && conditions.IsTrue(&machine, clusterv1.DrainingSucceededCondition) { |
There was a problem hiding this comment.
if the issue is that the machine gets deleted too quickly, I wonder if we could add a finalizer to it before the test starts (in the setup above) so it doesn't get deleted until we've done all the checks and we're ready to let it delete
There was a problem hiding this comment.
What's the proper scope for this PR? Given that these existing drain tests don't actually work at all. Should we simply abandon any changes here and follow-up with targeted improvements to drain testing?
8aeef17 to
9629164
Compare
|
/test pull-cluster-api-provider-azure-e2e-optional |
|
/lgtm let's merge this and follow up with some targetted improvements for LB and VMSS drain tests |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind failing-test
What this PR does / why we need it:
This PR introduces retries to E2E tests so that transient errors can be absorbed.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #2120
Special notes for your reviewer:
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
TODOs:
Release note: