Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Aug 13, 2019

Currently UPI bootstrap failures die with:

time="2019-08-13T20:38:56Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource"
time="2019-08-13T20:39:12Z" level=info msg="Use the following commands to gather logs from the cluster"
time="2019-08-13T20:39:12Z" level=info msg="openshift-install gather bootstrap --help"
time="2019-08-13T20:39:12Z" level=fatal msg="waiting for Kubernetes API: context deadline exceeded"

but don't actually gather those recommended logs. With this PR, I've added a setup-script global GATHER_BOOTSTRAP_ARGS which the various per-platform flows can populate as they create resources. Then if the wait-for-bootstrap command dies and that variable is non-empty, we run the gather to store the logs in the installer's artifact directory.

We can't use:

--master ${CONTROL_PLANE_0_IP},${CONTROL_PLANE_1_IP},${CONTROL_PLANE_2_IP}

because the backing installer code uses StringArrayVar, which does not perform StringSliceVar's comma-splitting.

The GATHER_BOOTSTRAP_ARGS approach is a bit of a cludge, because the expansion in gather-bootstrap-and-fail is not quoted; relying instead on a lack of shell-sensitive characters in the IP arguments. That's likely fine in practice, but if we wanted to tighten it down we could switch the script from sh to Bash and use an array variable. For now; I'm punting that to future work.

There's also crufy Terraform business around this in the teardown container, which I've left alone for now.

The WIP is because I'm forcing bootstrap failure to test the new gathering logic.

CC @ewolinetz

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 13, 2019
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 13, 2019
@wking wking force-pushed the upi-failed-install-gather branch from 4e5f77c to 676133b Compare August 13, 2019 23:21
…i-e2e: Gather on bootstrap failure

Currently UPI bootstrap failures die with [1]:

  time="2019-08-13T20:38:56Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource"
  time="2019-08-13T20:39:12Z" level=info msg="Use the following commands to gather logs from the cluster"
  time="2019-08-13T20:39:12Z" level=info msg="openshift-install gather bootstrap --help"
  time="2019-08-13T20:39:12Z" level=fatal msg="waiting for Kubernetes API: context deadline exceeded"

but don't actually gather those recommended logs [2].  With this
commit, I've added a setup-script global GATHER_BOOTSTRAP_ARGS which
the various per-platform flows can populate as they create resources.
Then if the wait-for-bootstrap command dies and that variable is
non-empty, we run the gather to store the logs in the installer's
artifact directory.

We can't use:

  --master ${CONTROL_PLANE_0_IP},${CONTROL_PLANE_1_IP},${CONTROL_PLANE_2_IP}

because the backing installer code [3] uses StringArrayVar [4], which
does not perform StringSliceVar's [5] comma-splitting.

The GATHER_BOOTSTRAP_ARGS approach is a bit of a cludge, because the
expansion in gather-bootstrap-and-fail is not quoted; relying instead
on a lack of shell-sensitive characters in the IP arguments.  That's
likely fine in practice, but if we wanted to tighten it down we could
switch the script from sh to Bash and use an array variable.  For now;
I'm punting that to future work.

There's also crufy Terraform business around this in the teardown
container, which I've left alone for now.

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/4719/rehearse-4719-pull-ci-openshift-installer-master-e2e-aws-proxy/5/artifacts/e2e-aws-proxy/installer/.openshift_install.log
[2]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_release/4719/rehearse-4719-pull-ci-openshift-installer-master-e2e-aws-proxy/5/artifacts/e2e-aws-proxy/installer/
[3]: https://github.com/openshift/installer/blob/8f972b45987a32cc91bc61c39a727e9a1224693d/cmd/openshift-install/gather.go#L71
[4]: https://godoc.org/github.com/spf13/pflag#FlagSet.StringArrayVar
[5]: https://godoc.org/github.com/spf13/pflag#FlagSet.StringSliceVar
@wking wking force-pushed the upi-failed-install-gather branch from 676133b to 42f9edf Compare August 14, 2019 02:51
@wking wking changed the title WIP: ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Gather on bootstrap failure ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Gather on bootstrap failure Aug 14, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 14, 2019
@wking wking force-pushed the upi-failed-install-gather branch from 42f9edf to 3013859 Compare August 14, 2019 03:51
@wking
Copy link
Member Author

wking commented Aug 14, 2019

Successful gather here using 42f9edfcb80b4d1386c3a617c2163b3e2feb315e:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/4734/rehearse-4734-pull-ci-openshift-installer-master-e2e-aws-upi/3/artifacts/e2e-aws-upi/installer/log-bundle-20190814033211.tar | tar -zt
./
./bootstrap/
./bootstrap/journals/
./bootstrap/journals/bootkube.log
./bootstrap/journals/openshift.log
./bootstrap/journals/kubelet.log
./bootstrap/journals/crio.log
./bootstrap/journals/approve-csr.log
./bootstrap/containers/
./bootstrap/containers/kube-apiserver-insecure-readyz-ef859e6d6c9d720b8bf20b69529cdcc4bcb1fe2e6ea397d801982417d76a8f3d.log
./bootstrap/containers/kube-apiserver-insecure-readyz-ef859e6d6c9d720b8bf20b69529cdcc4bcb1fe2e6ea397d801982417d76a8f3d.inspect
...
./control-plane/
./control-plane/10.0.52.86/
./control-plane/10.0.52.86/containers/
./control-plane/10.0.52.86/containers/pruner-8da8ad5a9be44b1b0fc8a64a62960a7c60c1697530e69b5c547e6104428425a3.log
./control-plane/10.0.52.86/containers/pruner-8da8ad5a9be44b1b0fc8a64a62960a7c60c1697530e69b5c547e6104428425a3.inspect
...
./resources/openapi.json.gz

I've dropped the WIP from the PR subject and the testing commit from the PR, and I think we're ready to merge :). CC @abhinavdahiya for review, although anyone dropping an /lgtm would be sufficient.

@ewolinetz
Copy link
Contributor

/test pj-rehearse

2 similar comments
@openshift-merge-robot
Copy link
Contributor

/test pj-rehearse

@openshift-merge-robot
Copy link
Contributor

/test pj-rehearse

@wking
Copy link
Member Author

wking commented Aug 21, 2019

@abhinavdahiya: can you take a look at this? It's hard to debug the black-holed-proxy work in #4719 without gathered bootstrap logs when we hang waiting for the Kubernetes API.

GATHER_BOOTSTRAP_ARGS=
function gather_bootstrap_and_fail() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why gather and fail. I don't think it's a great collection..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only gather when installation fails. By failing in the function regardless of gather success, I don't need to figure out how a || b || c groups in shell ;)

GATHER_BOOTSTRAP_ARGS=
function gather_bootstrap_and_fail() {
if test -n "${GATHER_BOOTSTRAP_ARGS}"; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass argument to the function?

Copy link
Member Author

@wking wking Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could. But we need a variable to gather these as you add a bootstrap machine and control planes machines. So inheritance seemed like a convenient way to pass them in. I can pass them explicitly if you prefer.

@ewolinetz
Copy link
Contributor

bump

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit f611a11 into openshift:master Aug 23, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the following 2 configmaps:

  • prow-job-cluster-launch-installer-upi-e2e configmap in namespace ci using the following files:
    • key cluster-launch-installer-upi-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e.yaml
  • prow-job-cluster-launch-installer-upi-e2e configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-upi-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e.yaml
Details

In response to this:

Currently UPI bootstrap failures die with:

time="2019-08-13T20:38:56Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource"
time="2019-08-13T20:39:12Z" level=info msg="Use the following commands to gather logs from the cluster"
time="2019-08-13T20:39:12Z" level=info msg="openshift-install gather bootstrap --help"
time="2019-08-13T20:39:12Z" level=fatal msg="waiting for Kubernetes API: context deadline exceeded"

but don't actually gather those recommended logs. With this PR, I've added a setup-script global GATHER_BOOTSTRAP_ARGS which the various per-platform flows can populate as they create resources. Then if the wait-for-bootstrap command dies and that variable is non-empty, we run the gather to store the logs in the installer's artifact directory.

We can't use:

--master ${CONTROL_PLANE_0_IP},${CONTROL_PLANE_1_IP},${CONTROL_PLANE_2_IP}

because the backing installer code uses StringArrayVar, which does not perform StringSliceVar's comma-splitting.

The GATHER_BOOTSTRAP_ARGS approach is a bit of a cludge, because the expansion in gather-bootstrap-and-fail is not quoted; relying instead on a lack of shell-sensitive characters in the IP arguments. That's likely fine in practice, but if we wanted to tighten it down we could switch the script from sh to Bash and use an array variable. For now; I'm punting that to future work.

There's also crufy Terraform business around this in the teardown container, which I've left alone for now.

The WIP is because I'm forcing bootstrap failure to test the new gathering logic.

CC @ewolinetz

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/rehearse/openshift/installer/master/e2e-aws-upi 3013859 link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-vsphere 3013859 link /test pj-rehearse
ci/prow/pj-rehearse 3013859 link /test pj-rehearse

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking added a commit to wking/openshift-docs that referenced this pull request Nov 8, 2019
This is what we do in CI since openshift/release@3013859bdc
(ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e:
Gather on bootstrap failure, 2019-08-13, openshift/release#4734), so
we know it works.  Comma-delimited machines won't work, as described
in that release commit.  I'd be surprised if space-delimited worked.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/openshift-docs that referenced this pull request Nov 8, 2019
This is what we do in CI since openshift/release@3013859bdc
(ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e:
Gather on bootstrap failure, 2019-08-13, openshift/release#4734), so
we know it works.  Comma-delimited machines won't work, as described
in that release commit.  I'd be surprised if space-delimited worked.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/openshift-docs that referenced this pull request Nov 8, 2019
This is what we do in CI since openshift/release@3013859bdc
(ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e:
Gather on bootstrap failure, 2019-08-13, openshift/release#4734), so
we know it works.  Comma-delimited machines won't work, as described
in that release commit.  I'd be surprised if space-delimited worked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants