Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Nov 5, 2018

We usually lose our watch when bootkube goes down. From this job:

$ oc project ci-op-9g1vhqtz
$ oc logs -f --timestamps e2e-aws -c setup | tee /tmp/setup.log
...
2018-11-03T04:19:55.121757935Z level=debug msg="added openshift-master-controllers.1563825412a4d77b: controller-manager-b5v49 became leader"
...
2018-11-03T04:20:14.679215171Z level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3069"
2018-11-03T04:20:16.539967372Z level=debug msg="added bootstrap-complete: cluster bootstrapping has completed"
2018-11-03T04:20:16.540030121Z level=info msg="Destroying the bootstrap resources..."
...

And simultaneously:

$ ssh -i libra.pem [email protected] journalctl -f | tee /tmp/bootstrap.log
...
Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: All self-hosted control plane components successfully started
Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: Tearing down temporary bootstrap control plane...
Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.968877     840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-cluster-version-operator-ip-10-0-10-86_openshift-cluster-version(99ccfef8309f84bf88a0ca4a277097ac)", container cluster-version-operator: selfLink was empty, can't make reference
Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.975624     840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-kube-apiserver-ip-10-0-10-86_kube-system(427a4a342e137b5a9bb39a0feff24625)", container kube-apiserver: selfLink was empty, can't make reference
Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.002990     840 pod_container_deletor.go:75] Container "0c647bcd6317ac2e06b625c44151aa6a3487aa1c47c5f1468213756f9a48ef91" not found in pod's containers
Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.005146     840 pod_container_deletor.go:75] Container "bc47233151f1c0afaaee9e7abcfec9a515fe0a720ed2251fd7a51602c59060c5" not found in pod's containers
Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: progress.service holdoff time over, scheduling restart.
Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Starting Report the completion of the cluster bootstrap process...
Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Started Report the completion of the cluster bootstrap process.
Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: Reporting install progress...
Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: event/bootstrap-complete created
Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: I1103 04:20:16.719526     840 reconciler.go:181] operationExecutor.UnmountVolume started for volume "secrets" (UniqueName: "kubernetes.io/host-path/427a4a342e137b5a9bb39a0feff24625-secrets") pod "427a4a342e137b5a9bb39a0feff24625" (UID: "427a4a342e137b5a9bb39a0feff24625")
...

The resourceVersion watch will usually allow the re-created watcher to pick up where its predecessor left off. But in at least some cases, that doesn't seem to be happening:

2018/11/02 23:30:00 Running pod e2e-aws
2018/11/02 23:48:00 Container test in pod e2e-aws completed successfully
2018/11/02 23:51:52 Container teardown in pod e2e-aws completed successfully
2018/11/03 00:08:33 Copying artifacts from e2e-aws into /logs/artifacts/e2e-aws
level=debug msg="Fetching \"Terraform Variables\"..."
...
level=debug msg="added openshift-master-controllers.1563734e367132e0: controller-manager-xlw62 became leader"
level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3288"
level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition"
2018/11/03 00:08:33 Container setup in pod e2e-aws failed, exit code 1, reason Error

That's unfortunately missing timestamps for the setup logs, but in the successful logs from ci-op-9g1vhqtz, you can see the controller-manager becoming a leader around 20 seconds before bootstrap-complete. That means the bootstrap-complete event probably fired around when the watcher dropped, which was probably well before the setup container timed out (~38 minutes after it was launched). The setup container timing out was probably the watch re-connect event hitting the 30 minute eventContext timeout.

I think what's happening is something like:

  1. The pods bootkube is waiting for come up.
  2. Bootkube tears itself down.
  3. Our initial watch breaks.
  4. The API becomes unstable.
  5. Watch reconnects here hang forever.
  6. The API stabilizes around the production control plane.
  7. Watch reconnects here successfully reconnect and pick up where the broken watch left off.

The instability is probably due to the load balancer, because a few rounds of health checks are needed before the target enters deregistration. A client connecting at that point that happens to be routed to the bootstrap node after its bootkube goes down may hang forever. With HTTP/1.1 connections, this sort of hang might hit TLSHandshakeTimeout (which is 10 seconds for DefaultTransport). But for HTTP/2 connections the watch re-connect might reuse an existing connection and therefore not need to renegotiate TLS.

I haven't been able to work up a reliable way to re-connect without hanging (#606), so this commit punts and makes failed watches acceptable errors. We'll now exit zero regardless of whether the watch succeeds, although we only auto-delete the bootstrap node when the watch succeeds. This should buy us some time to work out reliable re-connects.

We usually lose our watch when bootkube goes down.  From job [1]:

  $ oc project ci-op-9g1vhqtz
  $ oc logs -f --timestamps e2e-aws -c setup | tee /tmp/setup.log
  ...
  2018-11-03T04:19:55.121757935Z level=debug msg="added openshift-master-controllers.1563825412a4d77b: controller-manager-b5v49 became leader"
  ...
  2018-11-03T04:20:14.679215171Z level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3069"
  2018-11-03T04:20:16.539967372Z level=debug msg="added bootstrap-complete: cluster bootstrapping has completed"
  2018-11-03T04:20:16.540030121Z level=info msg="Destroying the bootstrap resources..."
  ...

And simultaneously:

  $ ssh -i libra.pem [email protected] journalctl -f | tee /tmp/bootstrap.log
  ...
  Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: All self-hosted control plane components successfully started
  Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: Tearing down temporary bootstrap control plane...
  Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.968877     840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-cluster-version-operator-ip-10-0-10-86_openshift-cluster-version(99ccfef8309f84bf88a0ca4a277097ac)", container cluster-version-operator: selfLink was empty, can't make reference
  Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.975624     840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-kube-apiserver-ip-10-0-10-86_kube-system(427a4a342e137b5a9bb39a0feff24625)", container kube-apiserver: selfLink was empty, can't make reference
  Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.002990     840 pod_container_deletor.go:75] Container "0c647bcd6317ac2e06b625c44151aa6a3487aa1c47c5f1468213756f9a48ef91" not found in pod's containers
  Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.005146     840 pod_container_deletor.go:75] Container "bc47233151f1c0afaaee9e7abcfec9a515fe0a720ed2251fd7a51602c59060c5" not found in pod's containers
  Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: progress.service holdoff time over, scheduling restart.
  Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Starting Report the completion of the cluster bootstrap process...
  Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Started Report the completion of the cluster bootstrap process.
  Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: Reporting install progress...
  Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: event/bootstrap-complete created
  Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: I1103 04:20:16.719526     840 reconciler.go:181] operationExecutor.UnmountVolume started for volume "secrets" (UniqueName: "kubernetes.io/host-path/427a4a342e137b5a9bb39a0feff24625-secrets") pod "427a4a342e137b5a9bb39a0feff24625" (UID: "427a4a342e137b5a9bb39a0feff24625")
  ...

The resourceVersion watch will usually allow the re-created watcher to
pick up where its predecessor left off [2].  But in at least some
cases, that doesn't seem to be happening [3]:

  2018/11/02 23:30:00 Running pod e2e-aws
  2018/11/02 23:48:00 Container test in pod e2e-aws completed successfully
  2018/11/02 23:51:52 Container teardown in pod e2e-aws completed successfully
  2018/11/03 00:08:33 Copying artifacts from e2e-aws into /logs/artifacts/e2e-aws
  level=debug msg="Fetching \"Terraform Variables\"..."
  ...
  level=debug msg="added openshift-master-controllers.1563734e367132e0: controller-manager-xlw62 became leader"
  level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3288"
  level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition"
  2018/11/03 00:08:33 Container setup in pod e2e-aws failed, exit code 1, reason Error

That's unfortunately missing timestamps for the setup logs, but in the
successful logs from ci-op-9g1vhqtz, you can see the
controller-manager becoming a leader around 20 seconds before
bootstrap-complete.  That means the bootstrap-complete event probably
fired around when the watcher dropped, which was probably well before
the setup container timed out (~38 minutes after it was launched).
The setup container timing out was probably the watch re-connect event
hitting the 30 minute eventContext timeout.

I think what's happening is something like:

1. The pods bootkube is waiting for come up.
2. Bootkube tears itself down.
3. Our initial watch breaks.
4. The API becomes unstable.
5. Watch reconnects here hang forever.
6. The API stabilizes around the production control plane.
7. Watch reconnects here successfully reconnect and pick up where the
   broken watch left off.

The instability is probably due to the load balancer, because a few
rounds of health checks [4] are needed before the target enters
deregistration [5].  A client connecting at that point that happens to
be routed [6] to the bootstrap node after its bootkube goes down may
hang forever.  With HTTP/1.1 connections, this sort of hang might hit
TLSHandshakeTimeout [7] (which is 10 seconds for DefaultTransport).
But for HTTP/2 connections the watch re-connect might reuse an
existing connection and therefore not need to renegotiate TLS.

I haven't been able to work up a reliable way to re-connect without
hanging, so this commit punts and makes failed watches acceptable
errors.  We'll now exit zero regardless of whether the watch succeeds,
although we only auto-delete the bootstrap node when the watch
succeeds.  This should buy us some time to work out reliable
re-connects.

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/595/pull-ci-openshift-installer-master-e2e-aws/1173
[2]: https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#concurrency-control-and-consistency
[3]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/45/pull-ci-openshift-cluster-version-operator-master-e2e-aws/58/build-log.txt
[4]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html
[5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#deregistration-delay
[6]: https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/how-elastic-load-balancing-works.html#routing-algorithm
[7]: https://golang.org/pkg/net/http/#Transport.TLSHandshakeTimeout
@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 5, 2018
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 5, 2018
@crawford
Copy link
Contributor

crawford commented Nov 5, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 5, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member Author

wking commented Nov 5, 2018

images:

2018/11/05 22:29:52 Copied 0.87Mi of artifacts from release-latest to /logs/artifacts/release-latest
rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped

2018/11/05 22:29:52 error: unable to signal to artifacts container to terminate in pod release-latest, triggering deletion: could not run remote command: command terminated with exit code 126
panic: runtime error: index out of range

goroutine 107 [running]:
github.com/openshift/ci-operator/pkg/steps.waitForPodCompletion(0x1285700, 0xc4205e6980, 0xc4205c2100, 0xe, 0x1275cc0, 0xc420340550, 0x7ffe2db3dee7, 0xf, 0x0, 0x444d7f)
	/go/src/github.com/openshift/ci-operator/pkg/steps/template.go:572 +0x3f7
github.com/openshift/ci-operator/pkg/steps.(*podStep).Run(0xc4205bc0e0, 0x1275a40, 0xc420053a80, 0x0, 0x5, 0xc42038d200)
	/go/src/github.com/openshift/ci-operator/pkg/steps/test.go:120 +0xc4f
github.com/openshift/ci-operator/pkg/steps/release.(*assembleReleaseStep).Run(0xc42036fae0, 0x1275a40, 0xc420053a80, 0xbef04ee4e19f7900, 0x30177c306f, 0x1974400)
	/go/src/github.com/openshift/ci-operator/pkg/steps/release/create_release.go:83 +0x773
github.com/openshift/ci-operator/pkg/steps.runStep(0x1275a40, 0xc420053a80, 0xc4202cff20, 0xc4203e0120, 0x0)
	/go/src/github.com/openshift/ci-operator/pkg/steps/run.go:105 +0x92
created by github.com/openshift/ci-operator/pkg/steps.Run
	/go/src/github.com/openshift/ci-operator/pkg/steps/run.go:71 +0x3a1

/retest

@openshift-merge-robot openshift-merge-robot merged commit 573af51 into openshift:master Nov 5, 2018
@wking wking deleted the allow-hung-watch branch November 7, 2018 17:23
wking added a commit to wking/openshift-installer that referenced this pull request Nov 27, 2018
This reverts commit 6dc1bf6, openshift#615.

109531f (cmd/openshift-install/create: Retry watch connections,
2018-11-02, openshift#606) made the watch re-connects reliable, so make watch
timeouts fatal again.  This avoids confusing users by showing "Install
complete!" messages when they may actually have a hung bootstrap
process.
flaper87 pushed a commit to flaper87/installer that referenced this pull request Nov 29, 2018
This reverts commit 6dc1bf6, openshift#615.

109531f (cmd/openshift-install/create: Retry watch connections,
2018-11-02, openshift#606) made the watch re-connects reliable, so make watch
timeouts fatal again.  This avoids confusing users by showing "Install
complete!" messages when they may actually have a hung bootstrap
process.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants