Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions test/e2e/azure_lb.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ import (
"sigs.k8s.io/cluster-api-provider-azure/test/e2e/kubernetes/windows"
)

const waitForLBCreateTimeout = 5 * time.Minute

// AzureLBSpecInput is the input for AzureLBSpec.
type AzureLBSpecInput struct {
BootstrapClusterProxy framework.ClusterProxy
Expand Down Expand Up @@ -151,7 +153,7 @@ func AzureLBSpec(ctx context.Context, inputGetter func() AzureLBSpecInput) {
return err
}
return nil
}, retryableOperationTimeout, retryableOperationSleepBetweenRetries).Should(Succeed())
}, waitForLBCreateTimeout, retryableOperationSleepBetweenRetries).Should(Succeed())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the logs Mar 10 19:25:52.827: INFO: waiting for service default/web-windowsarn1ku-elb to be available it looks like the flake is actually at the step above WaitForServiceAvailable https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2162/files#diff-b1cadbd362567a300bb6df435abeafb8f91e45cfbca237a17b37253120d87aa5R143


    �[91mFull Stack Trace�[0m
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.WaitForServiceAvailable({0x25dc9c0, 0xc0000560d0}, {{0x259ebe0, 0xc000bb4e20}, 0xc000d84780, 0xc000bd6900}, {0xc00047e760, 0x2, 0x2})
    	/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/helpers.go:262 +0x49

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

This is super confusing:

sigs.k8s.io/cluster-api-provider-azure/test/e2e.AzureLBSpec({0x25dc9c0, 0xc0000560d0}, 0x6)
    	/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_lb.go:220 +0x15aa

L229 on the branch that produced this flake are in a different flow:

In any event... it seems like we had a flake where the LB was not ready in 15 mins :(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this after #2157 merged?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prow does rebase before it runs jobs and looks like it did include the change since it has "waiting for the ilb service to be deleted":

�[1mSTEP�[0m: waiting for service default/web-windowsarn1ku-ilb to be available
Mar 10 19:24:09.332: INFO: waiting for service default/web-windowsarn1ku-ilb to be available
Mar 10 19:25:20.174: INFO: service default/web-windowsarn1ku-ilb is available, took 1m10.842282999s
�[1mSTEP�[0m: connecting to the internal LB service from a curl pod
Mar 10 19:25:20.279: INFO: starting to create a curl to ilb job
�[1mSTEP�[0m: waiting for job default/curl-to-ilb-jobx0ek1 to be complete
Mar 10 19:25:20.389: INFO: waiting for job default/curl-to-ilb-jobx0ek1 to be complete
Mar 10 19:25:30.598: INFO: job default/curl-to-ilb-jobx0ek1 is complete, took 10.209170132s
�[1mSTEP�[0m: deleting the ilb test resources
Mar 10 19:25:30.598: INFO: starting to delete the ilb service: web-windowsarn1ku-ilb
Mar 10 19:25:30.745: INFO: waiting for the ilb service to be deleted: web-windowsarn1ku-ilb
Mar 10 19:25:52.588: INFO: deleting the ilb job: curl-to-ilb-jobx0ek1
�[1mSTEP�[0m: creating an external Load Balancer service
Mar 10 19:25:52.697: INFO: starting to create an external Load Balancer service
�[1mSTEP�[0m: waiting for service default/web-windowsarn1ku-elb to be available
Mar 10 19:25:52.827: INFO: waiting for service default/web-windowsarn1ku-elb to be available

I would look at cloud provider logs to try and figure out what happened

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some sort of transient error happened:

E0310 19:40:56.433186       1 azure_backoff.go:332] InterfacesClient.CreateOrUpdate(capz-e2e-42b9n-nic) failed: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
E0310 19:40:56.433265       1 azure_loadbalancer.go:185] reconcileLoadBalancer(default/web-windowsarn1ku-elb) failed: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
I0310 19:40:56.433304       1 controller.go:848] Finished syncing service "default/web-windowsarn1ku-elb" (15m3.642610408s)
E0310 19:40:56.433327       1 controller.go:310] error processing service default/web-windowsarn1ku-elb (will retry): failed to ensure load balancer: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
I0310 19:40:56.433911       1 event.go:291] "Event occurred" object="default/web-windowsarn1ku-elb" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded"

and then it eventually succeeded:

I0310 19:41:07.302039       1 event.go:291] "Event occurred" object="default/web-windowsarn1ku-elb" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"

19:41 is over 15 minutes from 19:25 which explains the timeout.

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/2160/pull-cluster-api-provider-azure-e2e/1501991622571724800/artifacts/clusters/capz-e2e-7ehh4v-ha/kube-system/kube-controller-manager-capz-e2e-7ehh4v-ha-control-plane-bk7q7/kube-controller-manager.log

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just coming to the same conclusion

I0310 19:25:52.790705 1 controller.go:403] Ensuring load balancer for service default/web-windowsarn1ku-elb
...
I0310 19:41:07.440858 1 controller.go:848] Finished syncing service "default/web-windowsarn1ku-elb" (6.007310418s)

ilbIP := extractServiceIp(svc)

ilbJob := job.CreateCurlJobResourceSpec("curl-to-ilb-job", ilbIP)
Expand Down Expand Up @@ -222,7 +224,7 @@ func AzureLBSpec(ctx context.Context, inputGetter func() AzureLBSpecInput) {
return err
}
return nil
}, retryableOperationTimeout, retryableOperationSleepBetweenRetries).Should(Succeed())
}, waitForLBCreateTimeout, retryableOperationSleepBetweenRetries).Should(Succeed())

elbIP := extractServiceIp(svc)
Log("starting to create curl-to-elb job")
Expand Down