-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New gather subcommand to assist debugging bootstrap failures. #1627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cmd/openshift-install/gather.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that we copy the bundle to rootOpts.dir like all other files for installer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave it to the end user to decide where they want to copy the file, if not pwd.
13bc608 to
0e2c80e
Compare
|
LGTM |
…failure This change adds a new subcommand to assists with debugging a failed bootstrap by collecting logs from the cluster. Currently, it outputs the shell commands to run in order to gather these logs. It can later be extended to gather the logs itself. It is called automatically from the create cluster subcommand on a failed bootstrap.
|
@abhinavdahiya Squashed, just waiting on review labels now. |
/lgtm we can fix nits from @wking as we iterate on this. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, jstuever The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| @@ -1,7 +1,6 @@ | |||
| #!/usr/bin/env bash | |||
| set -eo pipefail | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This completely removes error checking, no? What happens if, for example, mkdir -p "${ARTIFACTS}/bootstrap/journals" fails and we continue to launch all the commands that attempt to fill it in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try to make sure this script never fails and gathers as much as it can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also the bootstrap node is very controlled env so things like mkdir failing, we'll probably catch in our CI ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also the bootstrap node is very controlled env so things like mkdir failing, we'll probably catch in our CI ;)
If we look and notice that things are missing. I'm unlikely to be that observant unless quite a lot is missing ;).
…e fails Folks might wish to wait longer, possibly after trying to manually recover some cluster component. Personally I'd rather drop the install-complete timeout entirely and have callers supply their own timeout like: $ timeout 1h openshift-install create cluster but Stephen Benjamin feels that the current installer output is not sufficiently clear to allow users to make informed decisions about whether waiting longer or not makes sense. Potentially product improvements like alerting on stuck-in-Provisioned compute machines and installer logging of firing alerts would help in this space. But until we can drop the timeout, pointing folks at the wait-for command makes that safety valve more discoverable. The "Use the following command..." language is originally from 07aa0e0 (cmd: add gather bootstrap subcommand for gathering logs on bootstrap failure, 2019-04-12, openshift#1627), so I'm just rolling forward with that approach instead of porting it to use argv[0] or something vs. it's current assumption that the installer command will be "openshift-install".
Adds a new subcommand to assists with debugging a failed bootstrap by collecting logs from the cluster. Currently, it outputs the shell commands to run in order to gather these logs. It can later be extended to gather the logs itself. It is called automatically from the create cluster subcommand on a failed bootstrap.
https://jira.coreos.com/browse/CORS-1050