New gather subcommand to assist debugging bootstrap failures. #1627

jstuever · 2019-04-16T16:29:31Z

Adds a new subcommand to assists with debugging a failed bootstrap by collecting logs from the cluster. Currently, it outputs the shell commands to run in order to gather these logs. It can later be extended to gather the logs itself. It is called automatically from the create cluster subcommand on a failed bootstrap.

https://jira.coreos.com/browse/CORS-1050

cmd/openshift-install/create.go

cmd/openshift-install/gather.go

cmd/openshift-install/waitfor.go

cmd/openshift-install/create.go

cmd/openshift-install/waitfor.go

cmd/openshift-install/gather.go

abhinavdahiya · 2019-04-17T18:07:49Z

cmd/openshift-install/gather.go

I would expect that we copy the bundle to rootOpts.dir like all other files for installer?

I would leave it to the end user to decide where they want to copy the file, if not pwd.

cmd/openshift-install/gather.go

cmd/openshift-install/waitfor.go

cmd/openshift-install/gather.go

abhinavdahiya · 2019-04-17T19:37:13Z

LGTM
ping @wking do you mind taking another look
can you squash the commits (i'm okay with just one) after ^^ and then we can move ahead with this.

…failure This change adds a new subcommand to assists with debugging a failed bootstrap by collecting logs from the cluster. Currently, it outputs the shell commands to run in order to gather these logs. It can later be extended to gather the logs itself. It is called automatically from the create cluster subcommand on a failed bootstrap.

jstuever · 2019-04-18T15:26:12Z

@abhinavdahiya Squashed, just waiting on review labels now.

abhinavdahiya · 2019-04-18T16:20:46Z

LGTM
ping @wking do you mind taking another look
can you squash the commits (i'm okay with just one) after ^^ and then we can move ahead with this.
@abhinavdahiya Squashed, just waiting on review labels now.

/lgtm

we can fix nits from @wking as we iterate on this.

openshift-ci-robot · 2019-04-18T16:20:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, jstuever

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2019-04-18T16:55:29Z

data/data/bootstrap/files/usr/local/bin/installer-gather.sh

@@ -1,7 +1,6 @@
 #!/usr/bin/env bash
-set -eo pipefail


This completely removes error checking, no? What happens if, for example, mkdir -p "${ARTIFACTS}/bootstrap/journals" fails and we continue to launch all the commands that attempt to fill it in?

I think we should try to make sure this script never fails and gathers as much as it can.

also the bootstrap node is very controlled env so things like mkdir failing, we'll probably catch in our CI ;)

also the bootstrap node is very controlled env so things like mkdir failing, we'll probably catch in our CI ;)

If we look and notice that things are missing. I'm unlikely to be that observant unless quite a lot is missing ;).

…e fails Folks might wish to wait longer, possibly after trying to manually recover some cluster component. Personally I'd rather drop the install-complete timeout entirely and have callers supply their own timeout like: $ timeout 1h openshift-install create cluster but Stephen Benjamin feels that the current installer output is not sufficiently clear to allow users to make informed decisions about whether waiting longer or not makes sense. Potentially product improvements like alerting on stuck-in-Provisioned compute machines and installer logging of firing alerts would help in this space. But until we can drop the timeout, pointing folks at the wait-for command makes that safety valve more discoverable. The "Use the following command..." language is originally from 07aa0e0 (cmd: add gather bootstrap subcommand for gathering logs on bootstrap failure, 2019-04-12, openshift#1627), so I'm just rolling forward with that approach instead of porting it to use argv[0] or something vs. it's current assumption that the installer command will be "openshift-install".

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 16, 2019

openshift-ci-robot requested review from abhinavdahiya and hardys April 16, 2019 16:29

jstuever force-pushed the cors1050 branch from c74dce8 to 88ab26e Compare April 16, 2019 16:35

abhinavdahiya reviewed Apr 16, 2019

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

abhinavdahiya reviewed Apr 16, 2019

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

jstuever force-pushed the cors1050 branch from 88ab26e to cb0e923 Compare April 16, 2019 17:41

wking reviewed Apr 16, 2019

View reviewed changes

cmd/openshift-install/gather.go Outdated Show resolved Hide resolved

wking reviewed Apr 16, 2019

View reviewed changes

cmd/openshift-install/gather.go Outdated Show resolved Hide resolved

wking reviewed Apr 16, 2019

View reviewed changes

cmd/openshift-install/gather.go Outdated Show resolved Hide resolved

jstuever force-pushed the cors1050 branch from cb0e923 to 78f34d4 Compare April 16, 2019 19:55

sferich888 reviewed Apr 16, 2019

View reviewed changes

cmd/openshift-install/waitfor.go Outdated Show resolved Hide resolved

abhinavdahiya reviewed Apr 16, 2019

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

abhinavdahiya reviewed Apr 16, 2019

View reviewed changes

cmd/openshift-install/waitfor.go Outdated Show resolved Hide resolved

jstuever force-pushed the cors1050 branch from 78f34d4 to 3b83b0f Compare April 17, 2019 12:59

abhinavdahiya reviewed Apr 17, 2019

View reviewed changes