Skip to content

add Eventually() to retryable k8s E2E operations#2123

Merged
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
jackfrancis:e2e-eventually
Mar 9, 2022
Merged

add Eventually() to retryable k8s E2E operations#2123
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
jackfrancis:e2e-eventually

Conversation

@jackfrancis
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind failing-test

What this PR does / why we need it:

This PR introduces retries to E2E tests so that transient errors can be absorbed.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2120

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 25, 2022
@jackfrancis jackfrancis changed the title add Eventually() to retryable k8s E2E operations WIP add Eventually() to retryable k8s E2E operations Feb 25, 2022
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 25, 2022
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Feb 25, 2022
Comment thread test/e2e/azure_machinepool_drain.go Outdated
@jackfrancis jackfrancis force-pushed the e2e-eventually branch 5 times, most recently from 6bb157a to 3b16de9 Compare February 28, 2022 19:00
@jackfrancis jackfrancis changed the title WIP add Eventually() to retryable k8s E2E operations add Eventually() to retryable k8s E2E operations Feb 28, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2022
@jackfrancis
Copy link
Copy Markdown
Contributor Author

/test ls

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@jackfrancis: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-azure-build
  • /test pull-cluster-api-provider-azure-e2e
  • /test pull-cluster-api-provider-azure-e2e-windows-dockershim
  • /test pull-cluster-api-provider-azure-test
  • /test pull-cluster-api-provider-azure-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-azure-apidiff
  • /test pull-cluster-api-provider-azure-apiversion-upgrade
  • /test pull-cluster-api-provider-azure-capi-e2e
  • /test pull-cluster-api-provider-azure-ci-entrypoint
  • /test pull-cluster-api-provider-azure-conformance
  • /test pull-cluster-api-provider-azure-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-azure-coverage
  • /test pull-cluster-api-provider-azure-e2e-exp
  • /test pull-cluster-api-provider-azure-e2e-optional
  • /test pull-cluster-api-provider-azure-e2e-workload-upgrade
  • /test pull-cluster-api-provider-azure-upstream-windows-dockershim
  • /test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts
  • /test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts-serial-slow

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-azure-apidiff
  • pull-cluster-api-provider-azure-build
  • pull-cluster-api-provider-azure-ci-entrypoint
  • pull-cluster-api-provider-azure-coverage
  • pull-cluster-api-provider-azure-e2e
  • pull-cluster-api-provider-azure-e2e-exp
  • pull-cluster-api-provider-azure-e2e-windows-dockershim
  • pull-cluster-api-provider-azure-test
  • pull-cluster-api-provider-azure-verify
Details

In response to this:

/test ls

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jackfrancis
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@jackfrancis
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@jackfrancis
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@jackfrancis
Copy link
Copy Markdown
Contributor Author

/assign @CecileRobertMichon

@jackfrancis jackfrancis force-pushed the e2e-eventually branch 2 times, most recently from 3a00057 to 536440a Compare March 1, 2022 23:15
@jackfrancis
Copy link
Copy Markdown
Contributor Author

/retest

@jackfrancis
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@jackfrancis
Copy link
Copy Markdown
Contributor Author

/retest

@jackfrancis jackfrancis force-pushed the e2e-eventually branch 2 times, most recently from bc4774f to 50cf7c9 Compare March 7, 2022 21:56
return helper.Patch(ctx, owningMachinePool)
})

By(fmt.Sprintf("checking for a machine to start draining for machine pool: %s/%s", amp.Namespace, amp.Name))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was never actually running, because the Eventually() was not actually running according to any success criteria. When I enabled the test using a retry + success criteria, it was failing consistently. After investigation, it seems pretty clear that evaluating for "started but hasn't yet finished draining" is a small enough window that it will regularly fail. Let's just get rid of this altogether and run the next test ("check for successful drain").

See https://kubernetes.slack.com/archives/CEX9HENG7/p1646686318723399

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's the "finished draining but hasn't yet deleted" window that is too small.

@jackfrancis
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional


return errors.New("no machine has finished draining")
})
// TODO setup a watcher to detect the terminal drain success state
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than checking for the expected state transitions serially, we can improve this in the future by setting a watch on the machine in the process of deleting. I'm making the judgment that doing that work is out of scope for this PR (this PR is about standardizing the usage of Eventually() for retryable operations).

Bottom line, this test coverage was never actually running due to the lack of .Should(Succeed()), so adding the "validate that drain begins after delete" is already net additive coverage.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we open an issue for this rather than leaving a TODO in the code? Also does this mean #2120 isn't fully fixed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The referenced issue documents a flake that should be addressed w/ the new Eventually() block in L191 below

@jackfrancis
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@jackfrancis
Copy link
Copy Markdown
Contributor Author

/retest

Comment thread test/e2e/aks.go
Comment thread test/e2e/aks.go
if err := input.Getter.Get(ctx, types.NamespacedName{Namespace: input.Namespace, Name: ref.Name},
ownerMachinePool); err != nil {
Logf("Failed to get machinePool: %+v", err)
LogWarningf("Failed to get machinePool: %+v", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like an actual failure and it's not in a retry block, why log it as a warning?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no existent error equivalent of Logf Do we want to create one?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, Logf is actually Info which is even less acurrate. I thought we were decreasing severity.

Comment thread test/e2e/azure_machinepool_drain.go Outdated
Comment thread test/e2e/azure_lb.go
err = servicesClient.Delete(ctx, ilbService.Name, metav1.DeleteOptions{})
Expect(err).NotTo(HaveOccurred())
Eventually(func() error {
err := servicesClient.Delete(ctx, ilbService.Name, metav1.DeleteOptions{})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on our previous conversation in slack, my understanding was that this Delete doesn't actually wait for the service to be deleted, it just waits for the client go Delete to return, which doesn't fully wait for the service to be gone. Is that correct?

If so, waiting 20 minutes doesn't seem appropriate. We should change it to actually wait for the service to be gone so that we avoid running into flakes later due to cloud provider being in the middle of reconciling the deleted service when we try to create a new service.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually not clear (to me) in the client-go code if the delete operation blocks on success here...

https://github.com/kubernetes/client-go/blob/v0.23.0/kubernetes/typed/core/v1/service.go#L160

I don't see a "wait" or "do not wait" option here

https://github.com/kubernetes/apimachinery/blob/v0.23.0/pkg/apis/meta/v1/types.go#L482

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the timing in logs it seems very likely that the delete is not blocking:

Mar  9 20:30:50.686: INFO: job default/curl-to-ilb-joboupqs is complete, took 10.069781536s
�[1mSTEP�[0m: deleting the ilb test resources
Mar  9 20:30:50.686: INFO: deleting the ilb service: webg7snfx-ilb
Mar  9 20:30:50.741: INFO: deleting the ilb job: curl-to-ilb-joboupqs
�[1mSTEP�[0m: creating an external Load Balancer service
Mar  9 20:30:50.776: INFO: starting to create an external Load Balancer service

We should add a check to make sure the service is gone before we proceed. We don't have to fix it in this PR but let's at least reduce the timeout, 20 minutes is too long if it's just for the action of starting the delete (and not the delete actually completing).

Comment thread test/e2e/azure_machinepool_drain.go Outdated
Comment thread test/e2e/azure_machinepool_drain.go Outdated
}, waitForDrainOperationTimeout, waitForDrainSleepBetweenRetries).Should(Succeed())

for _, machine := range ampmls {
if conditions.Has(&machine, clusterv1.DrainingSucceededCondition) && conditions.IsTrue(&machine, clusterv1.DrainingSucceededCondition) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the issue is that the machine gets deleted too quickly, I wonder if we could add a finalizer to it before the test starts (in the setup above) so it doesn't get deleted until we've done all the checks and we're ready to let it delete

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the proper scope for this PR? Given that these existing drain tests don't actually work at all. Should we simply abandon any changes here and follow-up with targeted improvements to drain testing?

@jackfrancis
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@CecileRobertMichon
Copy link
Copy Markdown
Contributor

/lgtm
/approve

let's merge this and follow up with some targetted improvements for LB and VMSS drain tests

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2022
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 9, 2022
@k8s-ci-robot k8s-ci-robot merged commit 22309f3 into kubernetes-sigs:main Mar 9, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.3 milestone Mar 9, 2022
@jackfrancis jackfrancis deleted the e2e-eventually branch December 9, 2022 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VMSS e2e test flake during drain

4 participants