add Eventually() to retryable k8s E2E operations by jackfrancis · Pull Request #2123 · kubernetes-sigs/cluster-api-provider-azure

jackfrancis · 2022-02-25T01:29:48Z

What type of PR is this?

/kind failing-test

What this PR does / why we need it:

This PR introduces retries to E2E tests so that transient errors can be absorbed.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2120

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

NONE

jackfrancis · 2022-02-28T21:44:12Z

/test ls

k8s-ci-robot · 2022-02-28T21:44:14Z

@jackfrancis: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-azure-build
/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-windows-dockershim
/test pull-cluster-api-provider-azure-test
/test pull-cluster-api-provider-azure-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-azure-apidiff
/test pull-cluster-api-provider-azure-apiversion-upgrade
/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-ci-entrypoint
/test pull-cluster-api-provider-azure-conformance
/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts
/test pull-cluster-api-provider-azure-coverage
/test pull-cluster-api-provider-azure-e2e-exp
/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-workload-upgrade
/test pull-cluster-api-provider-azure-upstream-windows-dockershim
/test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts
/test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts-serial-slow

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-azure-apidiff
pull-cluster-api-provider-azure-build
pull-cluster-api-provider-azure-ci-entrypoint
pull-cluster-api-provider-azure-coverage
pull-cluster-api-provider-azure-e2e
pull-cluster-api-provider-azure-e2e-exp
pull-cluster-api-provider-azure-e2e-windows-dockershim
pull-cluster-api-provider-azure-test
pull-cluster-api-provider-azure-verify

Details

In response to this:

/test ls

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jackfrancis · 2022-02-28T21:44:31Z

/test pull-cluster-api-provider-azure-e2e-optional

jackfrancis · 2022-02-28T23:20:57Z

/test pull-cluster-api-provider-azure-e2e-optional

jackfrancis · 2022-03-01T00:11:32Z

/test pull-cluster-api-provider-azure-e2e-optional

jackfrancis · 2022-03-01T01:23:53Z

/assign @CecileRobertMichon

jackfrancis · 2022-03-02T00:43:40Z

/retest

jackfrancis · 2022-03-02T03:55:39Z

/test pull-cluster-api-provider-azure-e2e-optional

jackfrancis · 2022-03-02T05:22:27Z

/retest

jackfrancis · 2022-03-07T21:59:16Z

 		return helper.Patch(ctx, owningMachinePool)
-	})
-
-	By(fmt.Sprintf("checking for a machine to start draining for machine pool: %s/%s", amp.Namespace, amp.Name))


This test was never actually running, because the Eventually() was not actually running according to any success criteria. When I enabled the test using a retry + success criteria, it was failing consistently. After investigation, it seems pretty clear that evaluating for "started but hasn't yet finished draining" is a small enough window that it will regularly fail. Let's just get rid of this altogether and run the next test ("check for successful drain").

See https://kubernetes.slack.com/archives/CEX9HENG7/p1646686318723399

Actually it's the "finished draining but hasn't yet deleted" window that is too small.

jackfrancis · 2022-03-07T22:43:19Z

/test pull-cluster-api-provider-azure-e2e-optional

jackfrancis · 2022-03-08T17:54:52Z

-
-		return errors.New("no machine has finished draining")
-	})
+	// TODO setup a watcher to detect the terminal drain success state


Rather than checking for the expected state transitions serially, we can improve this in the future by setting a watch on the machine in the process of deleting. I'm making the judgment that doing that work is out of scope for this PR (this PR is about standardizing the usage of Eventually() for retryable operations).

Bottom line, this test coverage was never actually running due to the lack of .Should(Succeed()), so adding the "validate that drain begins after delete" is already net additive coverage.

can we open an issue for this rather than leaving a TODO in the code? Also does this mean #2120 isn't fully fixed?

The referenced issue documents a flake that should be addressed w/ the new Eventually() block in L191 below

jackfrancis · 2022-03-08T18:47:30Z

/test pull-cluster-api-provider-azure-e2e-optional

jackfrancis · 2022-03-08T21:10:16Z

/retest

CecileRobertMichon · 2022-03-09T19:04:06Z

 				if err := input.Getter.Get(ctx, types.NamespacedName{Namespace: input.Namespace, Name: ref.Name},
 					ownerMachinePool); err != nil {
-					Logf("Failed to get machinePool: %+v", err)
+					LogWarningf("Failed to get machinePool: %+v", err)


this seems like an actual failure and it's not in a retry block, why log it as a warning?

There is no existent error equivalent of Logf Do we want to create one?

I see, Logf is actually Info which is even less acurrate. I thought we were decreasing severity.

CecileRobertMichon · 2022-03-09T19:09:27Z

-			err = servicesClient.Delete(ctx, ilbService.Name, metav1.DeleteOptions{})
-			Expect(err).NotTo(HaveOccurred())
+			Eventually(func() error {
+				err := servicesClient.Delete(ctx, ilbService.Name, metav1.DeleteOptions{})


based on our previous conversation in slack, my understanding was that this Delete doesn't actually wait for the service to be deleted, it just waits for the client go Delete to return, which doesn't fully wait for the service to be gone. Is that correct?

If so, waiting 20 minutes doesn't seem appropriate. We should change it to actually wait for the service to be gone so that we avoid running into flakes later due to cloud provider being in the middle of reconciling the deleted service when we try to create a new service.

It's actually not clear (to me) in the client-go code if the delete operation blocks on success here...

https://github.com/kubernetes/client-go/blob/v0.23.0/kubernetes/typed/core/v1/service.go#L160

I don't see a "wait" or "do not wait" option here

https://github.com/kubernetes/apimachinery/blob/v0.23.0/pkg/apis/meta/v1/types.go#L482

Based on the timing in logs it seems very likely that the delete is not blocking:

Mar 9 20:30:50.686: INFO: job default/curl-to-ilb-joboupqs is complete, took 10.069781536s �[1mSTEP�[0m: deleting the ilb test resources Mar 9 20:30:50.686: INFO: deleting the ilb service: webg7snfx-ilb Mar 9 20:30:50.741: INFO: deleting the ilb job: curl-to-ilb-joboupqs �[1mSTEP�[0m: creating an external Load Balancer service Mar 9 20:30:50.776: INFO: starting to create an external Load Balancer service

We should add a check to make sure the service is gone before we proceed. We don't have to fix it in this PR but let's at least reduce the timeout, 20 minutes is too long if it's just for the action of starting the delete (and not the delete actually completing).

CecileRobertMichon · 2022-03-09T19:17:34Z

+	}, waitForDrainOperationTimeout, waitForDrainSleepBetweenRetries).Should(Succeed())

-		for _, machine := range ampmls {
-			if conditions.Has(&machine, clusterv1.DrainingSucceededCondition) && conditions.IsTrue(&machine, clusterv1.DrainingSucceededCondition) {


if the issue is that the machine gets deleted too quickly, I wonder if we could add a finalizer to it before the test starts (in the setup above) so it doesn't get deleted until we've done all the checks and we're ready to let it delete

What's the proper scope for this PR? Given that these existing drain tests don't actually work at all. Should we simply abandon any changes here and follow-up with targeted improvements to drain testing?

jackfrancis · 2022-03-09T20:24:30Z

/test pull-cluster-api-provider-azure-e2e-optional

CecileRobertMichon · 2022-03-09T23:52:32Z

/lgtm
/approve

let's merge this and follow up with some targetted improvements for LB and VMSS drain tests

k8s-ci-robot · 2022-03-09T23:52:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [CecileRobertMichon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 25, 2022

jackfrancis changed the title ~~add Eventually() to retryable k8s E2E operations~~ WIP add Eventually() to retryable k8s E2E operations Feb 25, 2022

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 25, 2022

k8s-ci-robot requested review from CecileRobertMichon and juan-lee February 25, 2022 01:30

k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Feb 25, 2022

CecileRobertMichon reviewed Feb 25, 2022

View reviewed changes

Comment thread test/e2e/azure_machinepool_drain.go Outdated

jackfrancis force-pushed the e2e-eventually branch 5 times, most recently from 6bb157a to 3b16de9 Compare February 28, 2022 19:00

jackfrancis changed the title ~~WIP add Eventually() to retryable k8s E2E operations~~ add Eventually() to retryable k8s E2E operations Feb 28, 2022

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2022

jackfrancis force-pushed the e2e-eventually branch from 3b16de9 to 1eb9bad Compare February 28, 2022 21:07

jackfrancis force-pushed the e2e-eventually branch from 1eb9bad to e2c6631 Compare February 28, 2022 22:36

jackfrancis force-pushed the e2e-eventually branch from e2c6631 to 9c57bbf Compare March 1, 2022 00:08

k8s-ci-robot assigned CecileRobertMichon Mar 1, 2022

jackfrancis force-pushed the e2e-eventually branch 2 times, most recently from 3a00057 to 536440a Compare March 1, 2022 23:15

jackfrancis force-pushed the e2e-eventually branch from 536440a to 5199f80 Compare March 2, 2022 03:52

jackfrancis force-pushed the e2e-eventually branch 2 times, most recently from bc4774f to 50cf7c9 Compare March 7, 2022 21:56

jackfrancis commented Mar 7, 2022

View reviewed changes

jackfrancis force-pushed the e2e-eventually branch from 731692c to 8aeef17 Compare March 8, 2022 17:50

jackfrancis commented Mar 8, 2022

View reviewed changes

CecileRobertMichon reviewed Mar 9, 2022

View reviewed changes

Comment thread test/e2e/aks.go

CecileRobertMichon reviewed Mar 9, 2022

View reviewed changes

Comment thread test/e2e/azure_machinepool_drain.go Outdated

CecileRobertMichon reviewed Mar 9, 2022

View reviewed changes

Comment thread test/e2e/azure_machinepool_drain.go Outdated

CecileRobertMichon reviewed Mar 9, 2022

View reviewed changes

add Eventually() to retryable k8s E2E operations

9629164

jackfrancis force-pushed the e2e-eventually branch from 8aeef17 to 9629164 Compare March 9, 2022 19:52

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 9, 2022

k8s-ci-robot merged commit 22309f3 into kubernetes-sigs:main Mar 9, 2022

k8s-ci-robot added this to the v1.3 milestone Mar 9, 2022

CecileRobertMichon mentioned this pull request Mar 10, 2022

E2E: waiting for services to be deleted before proceeding #2157

Merged

3 tasks

jackfrancis deleted the e2e-eventually branch December 9, 2022 22:51

CecileRobertMichon mentioned this pull request Feb 21, 2023

Add proposal for Azure Service Operator #3113

Merged

3 tasks

Conversation

jackfrancis commented Feb 25, 2022

Uh oh!

Uh oh!

jackfrancis commented Feb 28, 2022

Uh oh!

k8s-ci-robot commented Feb 28, 2022

Uh oh!

jackfrancis commented Feb 28, 2022

Uh oh!

jackfrancis commented Feb 28, 2022

Uh oh!

jackfrancis commented Mar 1, 2022

Uh oh!

jackfrancis commented Mar 1, 2022

Uh oh!

jackfrancis commented Mar 2, 2022

Uh oh!

jackfrancis commented Mar 2, 2022

Uh oh!

jackfrancis commented Mar 2, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackfrancis commented Mar 7, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackfrancis commented Mar 8, 2022

Uh oh!

jackfrancis commented Mar 8, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackfrancis commented Mar 9, 2022

Uh oh!

CecileRobertMichon commented Mar 9, 2022

Uh oh!

k8s-ci-robot commented Mar 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants