-
Notifications
You must be signed in to change notification settings - Fork 1.5k
cmd/openshift-install/create: Allow hung watch #615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We usually lose our watch when bootkube goes down. From job [1]: $ oc project ci-op-9g1vhqtz $ oc logs -f --timestamps e2e-aws -c setup | tee /tmp/setup.log ... 2018-11-03T04:19:55.121757935Z level=debug msg="added openshift-master-controllers.1563825412a4d77b: controller-manager-b5v49 became leader" ... 2018-11-03T04:20:14.679215171Z level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3069" 2018-11-03T04:20:16.539967372Z level=debug msg="added bootstrap-complete: cluster bootstrapping has completed" 2018-11-03T04:20:16.540030121Z level=info msg="Destroying the bootstrap resources..." ... And simultaneously: $ ssh -i libra.pem [email protected] journalctl -f | tee /tmp/bootstrap.log ... Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: All self-hosted control plane components successfully started Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: Tearing down temporary bootstrap control plane... Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.968877 840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-cluster-version-operator-ip-10-0-10-86_openshift-cluster-version(99ccfef8309f84bf88a0ca4a277097ac)", container cluster-version-operator: selfLink was empty, can't make reference Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.975624 840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-kube-apiserver-ip-10-0-10-86_kube-system(427a4a342e137b5a9bb39a0feff24625)", container kube-apiserver: selfLink was empty, can't make reference Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.002990 840 pod_container_deletor.go:75] Container "0c647bcd6317ac2e06b625c44151aa6a3487aa1c47c5f1468213756f9a48ef91" not found in pod's containers Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.005146 840 pod_container_deletor.go:75] Container "bc47233151f1c0afaaee9e7abcfec9a515fe0a720ed2251fd7a51602c59060c5" not found in pod's containers Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: progress.service holdoff time over, scheduling restart. Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Starting Report the completion of the cluster bootstrap process... Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Started Report the completion of the cluster bootstrap process. Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: Reporting install progress... Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: event/bootstrap-complete created Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: I1103 04:20:16.719526 840 reconciler.go:181] operationExecutor.UnmountVolume started for volume "secrets" (UniqueName: "kubernetes.io/host-path/427a4a342e137b5a9bb39a0feff24625-secrets") pod "427a4a342e137b5a9bb39a0feff24625" (UID: "427a4a342e137b5a9bb39a0feff24625") ... The resourceVersion watch will usually allow the re-created watcher to pick up where its predecessor left off [2]. But in at least some cases, that doesn't seem to be happening [3]: 2018/11/02 23:30:00 Running pod e2e-aws 2018/11/02 23:48:00 Container test in pod e2e-aws completed successfully 2018/11/02 23:51:52 Container teardown in pod e2e-aws completed successfully 2018/11/03 00:08:33 Copying artifacts from e2e-aws into /logs/artifacts/e2e-aws level=debug msg="Fetching \"Terraform Variables\"..." ... level=debug msg="added openshift-master-controllers.1563734e367132e0: controller-manager-xlw62 became leader" level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3288" level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition" 2018/11/03 00:08:33 Container setup in pod e2e-aws failed, exit code 1, reason Error That's unfortunately missing timestamps for the setup logs, but in the successful logs from ci-op-9g1vhqtz, you can see the controller-manager becoming a leader around 20 seconds before bootstrap-complete. That means the bootstrap-complete event probably fired around when the watcher dropped, which was probably well before the setup container timed out (~38 minutes after it was launched). The setup container timing out was probably the watch re-connect event hitting the 30 minute eventContext timeout. I think what's happening is something like: 1. The pods bootkube is waiting for come up. 2. Bootkube tears itself down. 3. Our initial watch breaks. 4. The API becomes unstable. 5. Watch reconnects here hang forever. 6. The API stabilizes around the production control plane. 7. Watch reconnects here successfully reconnect and pick up where the broken watch left off. The instability is probably due to the load balancer, because a few rounds of health checks [4] are needed before the target enters deregistration [5]. A client connecting at that point that happens to be routed [6] to the bootstrap node after its bootkube goes down may hang forever. With HTTP/1.1 connections, this sort of hang might hit TLSHandshakeTimeout [7] (which is 10 seconds for DefaultTransport). But for HTTP/2 connections the watch re-connect might reuse an existing connection and therefore not need to renegotiate TLS. I haven't been able to work up a reliable way to re-connect without hanging, so this commit punts and makes failed watches acceptable errors. We'll now exit zero regardless of whether the watch succeeds, although we only auto-delete the bootstrap node when the watch succeeds. This should buy us some time to work out reliable re-connects. [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/595/pull-ci-openshift-installer-master-e2e-aws/1173 [2]: https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#concurrency-control-and-consistency [3]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/45/pull-ci-openshift-cluster-version-operator-master-e2e-aws/58/build-log.txt [4]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html [5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#deregistration-delay [6]: https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/how-elastic-load-balancing-works.html#routing-algorithm [7]: https://golang.org/pkg/net/http/#Transport.TLSHandshakeTimeout
Contributor
|
/lgtm |
Contributor
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: crawford, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Member
Author
/retest |
wking
added a commit
to wking/openshift-installer
that referenced
this pull request
Nov 27, 2018
This reverts commit 6dc1bf6, openshift#615. 109531f (cmd/openshift-install/create: Retry watch connections, 2018-11-02, openshift#606) made the watch re-connects reliable, so make watch timeouts fatal again. This avoids confusing users by showing "Install complete!" messages when they may actually have a hung bootstrap process.
flaper87
pushed a commit
to flaper87/installer
that referenced
this pull request
Nov 29, 2018
This reverts commit 6dc1bf6, openshift#615. 109531f (cmd/openshift-install/create: Retry watch connections, 2018-11-02, openshift#606) made the watch re-connects reliable, so make watch timeouts fatal again. This avoids confusing users by showing "Install complete!" messages when they may actually have a hung bootstrap process.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We usually lose our watch when bootkube goes down. From this job:
And simultaneously:
The
resourceVersionwatch will usually allow the re-created watcher to pick up where its predecessor left off. But in at least some cases, that doesn't seem to be happening:That's unfortunately missing timestamps for the setup logs, but in the successful logs from ci-op-9g1vhqtz, you can see the controller-manager becoming a leader around 20 seconds before
bootstrap-complete. That means thebootstrap-completeevent probably fired around when the watcher dropped, which was probably well before the setup container timed out (~38 minutes after it was launched). The setup container timing out was probably the watch re-connect event hitting the 30 minute eventContext timeout.I think what's happening is something like:
The instability is probably due to the load balancer, because a few rounds of health checks are needed before the target enters deregistration. A client connecting at that point that happens to be routed to the bootstrap node after its bootkube goes down may hang forever. With HTTP/1.1 connections, this sort of hang might hit
TLSHandshakeTimeout(which is 10 seconds forDefaultTransport). But for HTTP/2 connections the watch re-connect might reuse an existing connection and therefore not need to renegotiate TLS.I haven't been able to work up a reliable way to re-connect without hanging (#606), so this commit punts and makes failed watches acceptable errors. We'll now exit zero regardless of whether the watch succeeds, although we only auto-delete the bootstrap node when the watch succeeds. This should buy us some time to work out reliable re-connects.