cmd/openshift-install/create: Allow hung watch #615

wking · 2018-11-05T21:11:26Z

We usually lose our watch when bootkube goes down. From this job:

$ oc project ci-op-9g1vhqtz
$ oc logs -f --timestamps e2e-aws -c setup | tee /tmp/setup.log
...
2018-11-03T04:19:55.121757935Z level=debug msg="added openshift-master-controllers.1563825412a4d77b: controller-manager-b5v49 became leader"
...
2018-11-03T04:20:14.679215171Z level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3069"
2018-11-03T04:20:16.539967372Z level=debug msg="added bootstrap-complete: cluster bootstrapping has completed"
2018-11-03T04:20:16.540030121Z level=info msg="Destroying the bootstrap resources..."
...

And simultaneously:

$ ssh -i libra.pem [email protected] journalctl -f | tee /tmp/bootstrap.log
...
Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: All self-hosted control plane components successfully started
Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: Tearing down temporary bootstrap control plane...
Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.968877     840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-cluster-version-operator-ip-10-0-10-86_openshift-cluster-version(99ccfef8309f84bf88a0ca4a277097ac)", container cluster-version-operator: selfLink was empty, can't make reference
Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.975624     840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-kube-apiserver-ip-10-0-10-86_kube-system(427a4a342e137b5a9bb39a0feff24625)", container kube-apiserver: selfLink was empty, can't make reference
Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.002990     840 pod_container_deletor.go:75] Container "0c647bcd6317ac2e06b625c44151aa6a3487aa1c47c5f1468213756f9a48ef91" not found in pod's containers
Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.005146     840 pod_container_deletor.go:75] Container "bc47233151f1c0afaaee9e7abcfec9a515fe0a720ed2251fd7a51602c59060c5" not found in pod's containers
Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: progress.service holdoff time over, scheduling restart.
Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Starting Report the completion of the cluster bootstrap process...
Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Started Report the completion of the cluster bootstrap process.
Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: Reporting install progress...
Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: event/bootstrap-complete created
Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: I1103 04:20:16.719526     840 reconciler.go:181] operationExecutor.UnmountVolume started for volume "secrets" (UniqueName: "kubernetes.io/host-path/427a4a342e137b5a9bb39a0feff24625-secrets") pod "427a4a342e137b5a9bb39a0feff24625" (UID: "427a4a342e137b5a9bb39a0feff24625")
...

The resourceVersion watch will usually allow the re-created watcher to pick up where its predecessor left off. But in at least some cases, that doesn't seem to be happening:

2018/11/02 23:30:00 Running pod e2e-aws
2018/11/02 23:48:00 Container test in pod e2e-aws completed successfully
2018/11/02 23:51:52 Container teardown in pod e2e-aws completed successfully
2018/11/03 00:08:33 Copying artifacts from e2e-aws into /logs/artifacts/e2e-aws
level=debug msg="Fetching \"Terraform Variables\"..."
...
level=debug msg="added openshift-master-controllers.1563734e367132e0: controller-manager-xlw62 became leader"
level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3288"
level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition"
2018/11/03 00:08:33 Container setup in pod e2e-aws failed, exit code 1, reason Error

That's unfortunately missing timestamps for the setup logs, but in the successful logs from ci-op-9g1vhqtz, you can see the controller-manager becoming a leader around 20 seconds before bootstrap-complete. That means the bootstrap-complete event probably fired around when the watcher dropped, which was probably well before the setup container timed out (~38 minutes after it was launched). The setup container timing out was probably the watch re-connect event hitting the 30 minute eventContext timeout.

I think what's happening is something like:

The pods bootkube is waiting for come up.
Bootkube tears itself down.
Our initial watch breaks.
The API becomes unstable.
Watch reconnects here hang forever.
The API stabilizes around the production control plane.
Watch reconnects here successfully reconnect and pick up where the broken watch left off.

The instability is probably due to the load balancer, because a few rounds of health checks are needed before the target enters deregistration. A client connecting at that point that happens to be routed to the bootstrap node after its bootkube goes down may hang forever. With HTTP/1.1 connections, this sort of hang might hit TLSHandshakeTimeout (which is 10 seconds for DefaultTransport). But for HTTP/2 connections the watch re-connect might reuse an existing connection and therefore not need to renegotiate TLS.

I haven't been able to work up a reliable way to re-connect without hanging (#606), so this commit punts and makes failed watches acceptable errors. We'll now exit zero regardless of whether the watch succeeds, although we only auto-delete the bootstrap node when the watch succeeds. This should buy us some time to work out reliable re-connects.

We usually lose our watch when bootkube goes down. From job [1]: $ oc project ci-op-9g1vhqtz $ oc logs -f --timestamps e2e-aws -c setup | tee /tmp/setup.log ... 2018-11-03T04:19:55.121757935Z level=debug msg="added openshift-master-controllers.1563825412a4d77b: controller-manager-b5v49 became leader" ... 2018-11-03T04:20:14.679215171Z level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3069" 2018-11-03T04:20:16.539967372Z level=debug msg="added bootstrap-complete: cluster bootstrapping has completed" 2018-11-03T04:20:16.540030121Z level=info msg="Destroying the bootstrap resources..." ... And simultaneously: $ ssh -i libra.pem [email protected] journalctl -f | tee /tmp/bootstrap.log ... Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: All self-hosted control plane components successfully started Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: Tearing down temporary bootstrap control plane... Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.968877 840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-cluster-version-operator-ip-10-0-10-86_openshift-cluster-version(99ccfef8309f84bf88a0ca4a277097ac)", container cluster-version-operator: selfLink was empty, can't make reference Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.975624 840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-kube-apiserver-ip-10-0-10-86_kube-system(427a4a342e137b5a9bb39a0feff24625)", container kube-apiserver: selfLink was empty, can't make reference Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.002990 840 pod_container_deletor.go:75] Container "0c647bcd6317ac2e06b625c44151aa6a3487aa1c47c5f1468213756f9a48ef91" not found in pod's containers Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.005146 840 pod_container_deletor.go:75] Container "bc47233151f1c0afaaee9e7abcfec9a515fe0a720ed2251fd7a51602c59060c5" not found in pod's containers Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: progress.service holdoff time over, scheduling restart. Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Starting Report the completion of the cluster bootstrap process... Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Started Report the completion of the cluster bootstrap process. Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: Reporting install progress... Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: event/bootstrap-complete created Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: I1103 04:20:16.719526 840 reconciler.go:181] operationExecutor.UnmountVolume started for volume "secrets" (UniqueName: "kubernetes.io/host-path/427a4a342e137b5a9bb39a0feff24625-secrets") pod "427a4a342e137b5a9bb39a0feff24625" (UID: "427a4a342e137b5a9bb39a0feff24625") ... The resourceVersion watch will usually allow the re-created watcher to pick up where its predecessor left off [2]. But in at least some cases, that doesn't seem to be happening [3]: 2018/11/02 23:30:00 Running pod e2e-aws 2018/11/02 23:48:00 Container test in pod e2e-aws completed successfully 2018/11/02 23:51:52 Container teardown in pod e2e-aws completed successfully 2018/11/03 00:08:33 Copying artifacts from e2e-aws into /logs/artifacts/e2e-aws level=debug msg="Fetching \"Terraform Variables\"..." ... level=debug msg="added openshift-master-controllers.1563734e367132e0: controller-manager-xlw62 became leader" level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3288" level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition" 2018/11/03 00:08:33 Container setup in pod e2e-aws failed, exit code 1, reason Error That's unfortunately missing timestamps for the setup logs, but in the successful logs from ci-op-9g1vhqtz, you can see the controller-manager becoming a leader around 20 seconds before bootstrap-complete. That means the bootstrap-complete event probably fired around when the watcher dropped, which was probably well before the setup container timed out (~38 minutes after it was launched). The setup container timing out was probably the watch re-connect event hitting the 30 minute eventContext timeout. I think what's happening is something like: 1. The pods bootkube is waiting for come up. 2. Bootkube tears itself down. 3. Our initial watch breaks. 4. The API becomes unstable. 5. Watch reconnects here hang forever. 6. The API stabilizes around the production control plane. 7. Watch reconnects here successfully reconnect and pick up where the broken watch left off. The instability is probably due to the load balancer, because a few rounds of health checks [4] are needed before the target enters deregistration [5]. A client connecting at that point that happens to be routed [6] to the bootstrap node after its bootkube goes down may hang forever. With HTTP/1.1 connections, this sort of hang might hit TLSHandshakeTimeout [7] (which is 10 seconds for DefaultTransport). But for HTTP/2 connections the watch re-connect might reuse an existing connection and therefore not need to renegotiate TLS. I haven't been able to work up a reliable way to re-connect without hanging, so this commit punts and makes failed watches acceptable errors. We'll now exit zero regardless of whether the watch succeeds, although we only auto-delete the bootstrap node when the watch succeeds. This should buy us some time to work out reliable re-connects. [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/595/pull-ci-openshift-installer-master-e2e-aws/1173 [2]: https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#concurrency-control-and-consistency [3]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/45/pull-ci-openshift-cluster-version-operator-master-e2e-aws/58/build-log.txt [4]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html [5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#deregistration-delay [6]: https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/how-elastic-load-balancing-works.html#routing-algorithm [7]: https://golang.org/pkg/net/http/#Transport.TLSHandshakeTimeout

crawford · 2018-11-05T22:20:46Z

/lgtm

openshift-ci-robot · 2018-11-05T22:20:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [crawford,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2018-11-05T22:34:12Z

images:

2018/11/05 22:29:52 Copied 0.87Mi of artifacts from release-latest to /logs/artifacts/release-latest
rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped

2018/11/05 22:29:52 error: unable to signal to artifacts container to terminate in pod release-latest, triggering deletion: could not run remote command: command terminated with exit code 126
panic: runtime error: index out of range

goroutine 107 [running]:
github.com/openshift/ci-operator/pkg/steps.waitForPodCompletion(0x1285700, 0xc4205e6980, 0xc4205c2100, 0xe, 0x1275cc0, 0xc420340550, 0x7ffe2db3dee7, 0xf, 0x0, 0x444d7f)
	/go/src/github.com/openshift/ci-operator/pkg/steps/template.go:572 +0x3f7
github.com/openshift/ci-operator/pkg/steps.(*podStep).Run(0xc4205bc0e0, 0x1275a40, 0xc420053a80, 0x0, 0x5, 0xc42038d200)
	/go/src/github.com/openshift/ci-operator/pkg/steps/test.go:120 +0xc4f
github.com/openshift/ci-operator/pkg/steps/release.(*assembleReleaseStep).Run(0xc42036fae0, 0x1275a40, 0xc420053a80, 0xbef04ee4e19f7900, 0x30177c306f, 0x1974400)
	/go/src/github.com/openshift/ci-operator/pkg/steps/release/create_release.go:83 +0x773
github.com/openshift/ci-operator/pkg/steps.runStep(0x1275a40, 0xc420053a80, 0xc4202cff20, 0xc4203e0120, 0x0)
	/go/src/github.com/openshift/ci-operator/pkg/steps/run.go:105 +0x92
created by github.com/openshift/ci-operator/pkg/steps.Run
	/go/src/github.com/openshift/ci-operator/pkg/steps/run.go:71 +0x3a1

/retest

This reverts commit 6dc1bf6, openshift#615. 109531f (cmd/openshift-install/create: Retry watch connections, 2018-11-02, openshift#606) made the watch re-connects reliable, so make watch timeouts fatal again. This avoids confusing users by showing "Install complete!" messages when they may actually have a hung bootstrap process.

openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 5, 2018

openshift-ci-robot requested review from aaronlevy and crawford November 5, 2018 21:25

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 5, 2018

openshift-ci-robot assigned crawford Nov 5, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 5, 2018

wking mentioned this pull request Nov 5, 2018

waitForPodCompletion: panic: runtime error: index out of range openshift/ci-operator#197

Closed

openshift-merge-robot merged commit 573af51 into openshift:master Nov 5, 2018

wking deleted the allow-hung-watch branch November 7, 2018 17:23

wking mentioned this pull request Nov 10, 2018

cmd/openshift-install/create: Retry watch connections #606

Merged

wking mentioned this pull request Nov 27, 2018

clarify logComplete msgs and return err if wait for bootstrap complete exceeds timeout #727

Closed

wking mentioned this pull request Nov 27, 2018

Revert "cmd/openshift-install/create: Allow hung watch" #741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cmd/openshift-install/create: Allow hung watch #615

cmd/openshift-install/create: Allow hung watch #615

Uh oh!

wking commented Nov 5, 2018

Uh oh!

crawford commented Nov 5, 2018

Uh oh!

openshift-ci-robot commented Nov 5, 2018

Uh oh!

wking commented Nov 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cmd/openshift-install/create: Allow hung watch #615

cmd/openshift-install/create: Allow hung watch #615

Uh oh!

Conversation

wking commented Nov 5, 2018

Uh oh!

crawford commented Nov 5, 2018

Uh oh!

openshift-ci-robot commented Nov 5, 2018

Uh oh!

wking commented Nov 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants