Skip to content

Conversation

@jstuever
Copy link
Contributor

@jstuever jstuever commented Apr 16, 2019

Adds a new subcommand to assists with debugging a failed bootstrap by collecting logs from the cluster. Currently, it outputs the shell commands to run in order to gather these logs. It can later be extended to gather the logs itself. It is called automatically from the create cluster subcommand on a failed bootstrap.

https://jira.coreos.com/browse/CORS-1050

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 16, 2019
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that we copy the bundle to rootOpts.dir like all other files for installer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave it to the end user to decide where they want to copy the file, if not pwd.

@jstuever jstuever force-pushed the cors1050 branch 2 times, most recently from 13bc608 to 0e2c80e Compare April 17, 2019 19:22
@jstuever jstuever changed the title [WIP] Output commands to assist debugging when bootstrap fails. Output commands to assist debugging when bootstrap fails. Apr 17, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2019
@abhinavdahiya
Copy link
Contributor

LGTM
ping @wking do you mind taking another look
can you squash the commits (i'm okay with just one) after ^^ and then we can move ahead with this.

…failure

This change adds a new subcommand to assists with debugging a failed
bootstrap by collecting logs from the cluster. Currently, it outputs the
shell commands to run in order to gather these logs. It can later be
extended to gather the logs itself. It is called automatically from the
create cluster subcommand on a failed bootstrap.
@jstuever jstuever changed the title Output commands to assist debugging when bootstrap fails. New gather subcommand to assist debugging bootstrap failures. Apr 18, 2019
@jstuever
Copy link
Contributor Author

@abhinavdahiya Squashed, just waiting on review labels now.

@abhinavdahiya
Copy link
Contributor

LGTM
ping @wking do you mind taking another look
can you squash the commits (i'm okay with just one) after ^^ and then we can move ahead with this.
@abhinavdahiya Squashed, just waiting on review labels now.

/lgtm

we can fix nits from @wking as we iterate on this.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 18, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, jstuever

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2019
@openshift-merge-robot openshift-merge-robot merged commit 7458eda into openshift:master Apr 18, 2019
@@ -1,7 +1,6 @@
#!/usr/bin/env bash
set -eo pipefail
Copy link
Member

@wking wking Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This completely removes error checking, no? What happens if, for example, mkdir -p "${ARTIFACTS}/bootstrap/journals" fails and we continue to launch all the commands that attempt to fill it in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to make sure this script never fails and gathers as much as it can.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the bootstrap node is very controlled env so things like mkdir failing, we'll probably catch in our CI ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the bootstrap node is very controlled env so things like mkdir failing, we'll probably catch in our CI ;)

If we look and notice that things are missing. I'm unlikely to be that observant unless quite a lot is missing ;).

@jstuever jstuever deleted the cors1050 branch April 23, 2019 13:50
wking added a commit to wking/openshift-installer that referenced this pull request Jun 5, 2020
…e fails

Folks might wish to wait longer, possibly after trying to manually
recover some cluster component.  Personally I'd rather drop the
install-complete timeout entirely and have callers supply their own
timeout like:

  $ timeout 1h openshift-install create cluster

but Stephen Benjamin feels that the current installer output is not
sufficiently clear to allow users to make informed decisions about
whether waiting longer or not makes sense.  Potentially product
improvements like alerting on stuck-in-Provisioned compute machines
and installer logging of firing alerts would help in this space.  But
until we can drop the timeout, pointing folks at the wait-for command
makes that safety valve more discoverable.

The "Use the following command..." language is originally from
07aa0e0 (cmd: add gather bootstrap subcommand for gathering logs on
bootstrap failure, 2019-04-12, openshift#1627), so I'm just rolling forward
with that approach instead of porting it to use argv[0] or something
vs. it's current assumption that the installer command will be
"openshift-install".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants