-
Notifications
You must be signed in to change notification settings - Fork 2.1k
ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Gather on bootstrap failure #4734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Gather on bootstrap failure #4734
Conversation
4e5f77c to
676133b
Compare
…i-e2e: Gather on bootstrap failure
Currently UPI bootstrap failures die with [1]:
time="2019-08-13T20:38:56Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource"
time="2019-08-13T20:39:12Z" level=info msg="Use the following commands to gather logs from the cluster"
time="2019-08-13T20:39:12Z" level=info msg="openshift-install gather bootstrap --help"
time="2019-08-13T20:39:12Z" level=fatal msg="waiting for Kubernetes API: context deadline exceeded"
but don't actually gather those recommended logs [2]. With this
commit, I've added a setup-script global GATHER_BOOTSTRAP_ARGS which
the various per-platform flows can populate as they create resources.
Then if the wait-for-bootstrap command dies and that variable is
non-empty, we run the gather to store the logs in the installer's
artifact directory.
We can't use:
--master ${CONTROL_PLANE_0_IP},${CONTROL_PLANE_1_IP},${CONTROL_PLANE_2_IP}
because the backing installer code [3] uses StringArrayVar [4], which
does not perform StringSliceVar's [5] comma-splitting.
The GATHER_BOOTSTRAP_ARGS approach is a bit of a cludge, because the
expansion in gather-bootstrap-and-fail is not quoted; relying instead
on a lack of shell-sensitive characters in the IP arguments. That's
likely fine in practice, but if we wanted to tighten it down we could
switch the script from sh to Bash and use an array variable. For now;
I'm punting that to future work.
There's also crufy Terraform business around this in the teardown
container, which I've left alone for now.
[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/4719/rehearse-4719-pull-ci-openshift-installer-master-e2e-aws-proxy/5/artifacts/e2e-aws-proxy/installer/.openshift_install.log
[2]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_release/4719/rehearse-4719-pull-ci-openshift-installer-master-e2e-aws-proxy/5/artifacts/e2e-aws-proxy/installer/
[3]: https://github.com/openshift/installer/blob/8f972b45987a32cc91bc61c39a727e9a1224693d/cmd/openshift-install/gather.go#L71
[4]: https://godoc.org/github.com/spf13/pflag#FlagSet.StringArrayVar
[5]: https://godoc.org/github.com/spf13/pflag#FlagSet.StringSliceVar
676133b to
42f9edf
Compare
42f9edf to
3013859
Compare
|
Successful gather here using 42f9edfcb80b4d1386c3a617c2163b3e2feb315e: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/4734/rehearse-4734-pull-ci-openshift-installer-master-e2e-aws-upi/3/artifacts/e2e-aws-upi/installer/log-bundle-20190814033211.tar | tar -zt
./
./bootstrap/
./bootstrap/journals/
./bootstrap/journals/bootkube.log
./bootstrap/journals/openshift.log
./bootstrap/journals/kubelet.log
./bootstrap/journals/crio.log
./bootstrap/journals/approve-csr.log
./bootstrap/containers/
./bootstrap/containers/kube-apiserver-insecure-readyz-ef859e6d6c9d720b8bf20b69529cdcc4bcb1fe2e6ea397d801982417d76a8f3d.log
./bootstrap/containers/kube-apiserver-insecure-readyz-ef859e6d6c9d720b8bf20b69529cdcc4bcb1fe2e6ea397d801982417d76a8f3d.inspect
...
./control-plane/
./control-plane/10.0.52.86/
./control-plane/10.0.52.86/containers/
./control-plane/10.0.52.86/containers/pruner-8da8ad5a9be44b1b0fc8a64a62960a7c60c1697530e69b5c547e6104428425a3.log
./control-plane/10.0.52.86/containers/pruner-8da8ad5a9be44b1b0fc8a64a62960a7c60c1697530e69b5c547e6104428425a3.inspect
...
./resources/openapi.json.gzI've dropped the WIP from the PR subject and the testing commit from the PR, and I think we're ready to merge :). CC @abhinavdahiya for review, although anyone dropping an |
|
/test pj-rehearse |
2 similar comments
|
/test pj-rehearse |
|
/test pj-rehearse |
|
@abhinavdahiya: can you take a look at this? It's hard to debug the black-holed-proxy work in #4719 without gathered bootstrap logs when we hang waiting for the Kubernetes API. |
| GATHER_BOOTSTRAP_ARGS= | ||
| function gather_bootstrap_and_fail() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why gather and fail. I don't think it's a great collection..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only gather when installation fails. By failing in the function regardless of gather success, I don't need to figure out how a || b || c groups in shell ;)
| GATHER_BOOTSTRAP_ARGS= | ||
| function gather_bootstrap_and_fail() { | ||
| if test -n "${GATHER_BOOTSTRAP_ARGS}"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass argument to the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could. But we need a variable to gather these as you add a bootstrap machine and control planes machines. So inheritance seemed like a convenient way to pass them in. I can pass them explicitly if you prefer.
|
bump |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@wking: Updated the following 2 configmaps:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This is what we do in CI since openshift/release@3013859bdc (ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Gather on bootstrap failure, 2019-08-13, openshift/release#4734), so we know it works. Comma-delimited machines won't work, as described in that release commit. I'd be surprised if space-delimited worked.
This is what we do in CI since openshift/release@3013859bdc (ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Gather on bootstrap failure, 2019-08-13, openshift/release#4734), so we know it works. Comma-delimited machines won't work, as described in that release commit. I'd be surprised if space-delimited worked.
This is what we do in CI since openshift/release@3013859bdc (ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Gather on bootstrap failure, 2019-08-13, openshift/release#4734), so we know it works. Comma-delimited machines won't work, as described in that release commit. I'd be surprised if space-delimited worked.
Currently UPI bootstrap failures die with:
but don't actually gather those recommended logs. With this PR, I've added a setup-script global
GATHER_BOOTSTRAP_ARGSwhich the various per-platform flows can populate as they create resources. Then if the wait-for-bootstrap command dies and that variable is non-empty, we run the gather to store the logs in the installer's artifact directory.We can't use:
because the backing installer code uses
StringArrayVar, which does not performStringSliceVar's comma-splitting.The
GATHER_BOOTSTRAP_ARGSapproach is a bit of a cludge, because the expansion in gather-bootstrap-and-fail is not quoted; relying instead on a lack of shell-sensitive characters in the IP arguments. That's likely fine in practice, but if we wanted to tighten it down we could switch the script from sh to Bash and use an array variable. For now; I'm punting that to future work.There's also crufy Terraform business around this in the
teardowncontainer, which I've left alone for now.The
WIPis because I'm forcing bootstrap failure to test the new gathering logic.CC @ewolinetz