e2e: wait 5 mins for LB create#2162
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| } | ||
| return nil | ||
| }, retryableOperationTimeout, retryableOperationSleepBetweenRetries).Should(Succeed()) | ||
| }, waitForLBCreateTimeout, retryableOperationSleepBetweenRetries).Should(Succeed()) |
There was a problem hiding this comment.
Looking at the logs Mar 10 19:25:52.827: INFO: waiting for service default/web-windowsarn1ku-elb to be available it looks like the flake is actually at the step above WaitForServiceAvailable https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2162/files#diff-b1cadbd362567a300bb6df435abeafb8f91e45cfbca237a17b37253120d87aa5R143
�[91mFull Stack Trace�[0m
sigs.k8s.io/cluster-api-provider-azure/test/e2e.WaitForServiceAvailable({0x25dc9c0, 0xc0000560d0}, {{0x259ebe0, 0xc000bb4e20}, 0xc000d84780, 0xc000bd6900}, {0xc00047e760, 0x2, 0x2})
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/helpers.go:262 +0x49
There was a problem hiding this comment.
You're right.
This is super confusing:
sigs.k8s.io/cluster-api-provider-azure/test/e2e.AzureLBSpec({0x25dc9c0, 0xc0000560d0}, 0x6)
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_lb.go:220 +0x15aa
L229 on the branch that produced this flake are in a different flow:
In any event... it seems like we had a flake where the LB was not ready in 15 mins :(
There was a problem hiding this comment.
based on commit history of this branch, no
There was a problem hiding this comment.
Prow does rebase before it runs jobs and looks like it did include the change since it has "waiting for the ilb service to be deleted":
�[1mSTEP�[0m: waiting for service default/web-windowsarn1ku-ilb to be available
Mar 10 19:24:09.332: INFO: waiting for service default/web-windowsarn1ku-ilb to be available
Mar 10 19:25:20.174: INFO: service default/web-windowsarn1ku-ilb is available, took 1m10.842282999s
�[1mSTEP�[0m: connecting to the internal LB service from a curl pod
Mar 10 19:25:20.279: INFO: starting to create a curl to ilb job
�[1mSTEP�[0m: waiting for job default/curl-to-ilb-jobx0ek1 to be complete
Mar 10 19:25:20.389: INFO: waiting for job default/curl-to-ilb-jobx0ek1 to be complete
Mar 10 19:25:30.598: INFO: job default/curl-to-ilb-jobx0ek1 is complete, took 10.209170132s
�[1mSTEP�[0m: deleting the ilb test resources
Mar 10 19:25:30.598: INFO: starting to delete the ilb service: web-windowsarn1ku-ilb
Mar 10 19:25:30.745: INFO: waiting for the ilb service to be deleted: web-windowsarn1ku-ilb
Mar 10 19:25:52.588: INFO: deleting the ilb job: curl-to-ilb-jobx0ek1
�[1mSTEP�[0m: creating an external Load Balancer service
Mar 10 19:25:52.697: INFO: starting to create an external Load Balancer service
�[1mSTEP�[0m: waiting for service default/web-windowsarn1ku-elb to be available
Mar 10 19:25:52.827: INFO: waiting for service default/web-windowsarn1ku-elb to be available
I would look at cloud provider logs to try and figure out what happened
There was a problem hiding this comment.
Looks like some sort of transient error happened:
E0310 19:40:56.433186 1 azure_backoff.go:332] InterfacesClient.CreateOrUpdate(capz-e2e-42b9n-nic) failed: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
E0310 19:40:56.433265 1 azure_loadbalancer.go:185] reconcileLoadBalancer(default/web-windowsarn1ku-elb) failed: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
I0310 19:40:56.433304 1 controller.go:848] Finished syncing service "default/web-windowsarn1ku-elb" (15m3.642610408s)
E0310 19:40:56.433327 1 controller.go:310] error processing service default/web-windowsarn1ku-elb (will retry): failed to ensure load balancer: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
I0310 19:40:56.433911 1 event.go:291] "Event occurred" object="default/web-windowsarn1ku-elb" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded"
and then it eventually succeeded:
I0310 19:41:07.302039 1 event.go:291] "Event occurred" object="default/web-windowsarn1ku-elb" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"
19:41 is over 15 minutes from 19:25 which explains the timeout.
There was a problem hiding this comment.
I was just coming to the same conclusion
I0310 19:25:52.790705 1 controller.go:403] Ensuring load balancer for service default/web-windowsarn1ku-elb
...
I0310 19:41:07.440858 1 controller.go:848] Finished syncing service "default/web-windowsarn1ku-elb" (6.007310418s)
|
@jackfrancis: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What type of PR is this?
/kind flake
What this PR does / why we need it:
Observed a flake waiting for LB create here:
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/2160/pull-cluster-api-provider-azure-e2e/1501991622571724800/build-log.txt
This PR increases the timeout tolerance for the creation of a LoadBalancer to 5 mins from 30 seconds.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
TODOs:
Release note: