Skip to content

e2e: wait 5 mins for LB create#2162

Closed
jackfrancis wants to merge 1 commit into
kubernetes-sigs:mainfrom
jackfrancis:e2e-lb-wait-for-create
Closed

e2e: wait 5 mins for LB create#2162
jackfrancis wants to merge 1 commit into
kubernetes-sigs:mainfrom
jackfrancis:e2e-lb-wait-for-create

Conversation

@jackfrancis
Copy link
Copy Markdown
Contributor

@jackfrancis jackfrancis commented Mar 10, 2022

What type of PR is this?

/kind flake

What this PR does / why we need it:

Observed a flake waiting for LB create here:

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/2160/pull-cluster-api-provider-azure-e2e/1501991622571724800/build-log.txt

This PR increases the timeout tolerance for the creation of a LoadBalancer to 5 mins from 30 seconds.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 10, 2022
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign neolit123 after the PR has been reviewed.
You can assign the PR to them by writing /assign @neolit123 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 10, 2022
Comment thread test/e2e/azure_lb.go
}
return nil
}, retryableOperationTimeout, retryableOperationSleepBetweenRetries).Should(Succeed())
}, waitForLBCreateTimeout, retryableOperationSleepBetweenRetries).Should(Succeed())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the logs Mar 10 19:25:52.827: INFO: waiting for service default/web-windowsarn1ku-elb to be available it looks like the flake is actually at the step above WaitForServiceAvailable https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2162/files#diff-b1cadbd362567a300bb6df435abeafb8f91e45cfbca237a17b37253120d87aa5R143


    �[91mFull Stack Trace�[0m
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.WaitForServiceAvailable({0x25dc9c0, 0xc0000560d0}, {{0x259ebe0, 0xc000bb4e20}, 0xc000d84780, 0xc000bd6900}, {0xc00047e760, 0x2, 0x2})
    	/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/helpers.go:262 +0x49

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

This is super confusing:

sigs.k8s.io/cluster-api-provider-azure/test/e2e.AzureLBSpec({0x25dc9c0, 0xc0000560d0}, 0x6)
    	/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_lb.go:220 +0x15aa

L229 on the branch that produced this flake are in a different flow:

In any event... it seems like we had a flake where the LB was not ready in 15 mins :(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this after #2157 merged?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prow does rebase before it runs jobs and looks like it did include the change since it has "waiting for the ilb service to be deleted":

�[1mSTEP�[0m: waiting for service default/web-windowsarn1ku-ilb to be available
Mar 10 19:24:09.332: INFO: waiting for service default/web-windowsarn1ku-ilb to be available
Mar 10 19:25:20.174: INFO: service default/web-windowsarn1ku-ilb is available, took 1m10.842282999s
�[1mSTEP�[0m: connecting to the internal LB service from a curl pod
Mar 10 19:25:20.279: INFO: starting to create a curl to ilb job
�[1mSTEP�[0m: waiting for job default/curl-to-ilb-jobx0ek1 to be complete
Mar 10 19:25:20.389: INFO: waiting for job default/curl-to-ilb-jobx0ek1 to be complete
Mar 10 19:25:30.598: INFO: job default/curl-to-ilb-jobx0ek1 is complete, took 10.209170132s
�[1mSTEP�[0m: deleting the ilb test resources
Mar 10 19:25:30.598: INFO: starting to delete the ilb service: web-windowsarn1ku-ilb
Mar 10 19:25:30.745: INFO: waiting for the ilb service to be deleted: web-windowsarn1ku-ilb
Mar 10 19:25:52.588: INFO: deleting the ilb job: curl-to-ilb-jobx0ek1
�[1mSTEP�[0m: creating an external Load Balancer service
Mar 10 19:25:52.697: INFO: starting to create an external Load Balancer service
�[1mSTEP�[0m: waiting for service default/web-windowsarn1ku-elb to be available
Mar 10 19:25:52.827: INFO: waiting for service default/web-windowsarn1ku-elb to be available

I would look at cloud provider logs to try and figure out what happened

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some sort of transient error happened:

E0310 19:40:56.433186       1 azure_backoff.go:332] InterfacesClient.CreateOrUpdate(capz-e2e-42b9n-nic) failed: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
E0310 19:40:56.433265       1 azure_loadbalancer.go:185] reconcileLoadBalancer(default/web-windowsarn1ku-elb) failed: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
I0310 19:40:56.433304       1 controller.go:848] Finished syncing service "default/web-windowsarn1ku-elb" (15m3.642610408s)
E0310 19:40:56.433327       1 controller.go:310] error processing service default/web-windowsarn1ku-elb (will retry): failed to ensure load balancer: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
I0310 19:40:56.433911       1 event.go:291] "Event occurred" object="default/web-windowsarn1ku-elb" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded"

and then it eventually succeeded:

I0310 19:41:07.302039       1 event.go:291] "Event occurred" object="default/web-windowsarn1ku-elb" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"

19:41 is over 15 minutes from 19:25 which explains the timeout.

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/2160/pull-cluster-api-provider-azure-e2e/1501991622571724800/artifacts/clusters/capz-e2e-7ehh4v-ha/kube-system/kube-controller-manager-capz-e2e-7ehh4v-ha-control-plane-bk7q7/kube-controller-manager.log

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just coming to the same conclusion

I0310 19:25:52.790705 1 controller.go:403] Ensuring load balancer for service default/web-windowsarn1ku-elb
...
I0310 19:41:07.440858 1 controller.go:848] Finished syncing service "default/web-windowsarn1ku-elb" (6.007310418s)

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@jackfrancis: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-ci-entrypoint 86f39f5 link false /test pull-cluster-api-provider-azure-ci-entrypoint

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants