Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jun 5, 2020

Folks might wish to wait longer, possibly after trying to manually recover some cluster component. Personally I'd rather drop the install-complete timeout entirely and have callers supply their own timeout like:

$ timeout 1h openshift-install create cluster

but @stbenjam feels that the current installer output is not sufficiently clear to allow users to make informed decisions about whether waiting longer or not makes sense. Potentially product improvements like alerting on stuck-in-Provisioned compute machines and installer logging of firing alerts would help in this space. But until we can drop the timeout, pointing folks at the wait-for command makes that safety valve more discoverable.

The Use the following command... language is originally from 07aa0e0 (#1627), so I'm just rolling forward with that approach instead of porting it to use argv[0] or something vs. it's current assumption that the installer command will be openshift-install.

…e fails

Folks might wish to wait longer, possibly after trying to manually
recover some cluster component.  Personally I'd rather drop the
install-complete timeout entirely and have callers supply their own
timeout like:

  $ timeout 1h openshift-install create cluster

but Stephen Benjamin feels that the current installer output is not
sufficiently clear to allow users to make informed decisions about
whether waiting longer or not makes sense.  Potentially product
improvements like alerting on stuck-in-Provisioned compute machines
and installer logging of firing alerts would help in this space.  But
until we can drop the timeout, pointing folks at the wait-for command
makes that safety valve more discoverable.

The "Use the following command..." language is originally from
07aa0e0 (cmd: add gather bootstrap subcommand for gathering logs on
bootstrap failure, 2019-04-12, openshift#1627), so I'm just rolling forward
with that approach instead of porting it to use argv[0] or something
vs. it's current assumption that the installer command will be
"openshift-install".
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign wking
You can assign the PR to them by writing /assign @wking in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

logrus.Error("Attempted to gather ClusterOperator status after installation failure: ", err2)
}
logrus.Info("Use the following command if you want to wait longer for install completion:")
logrus.Info("openshift-install wait-for install-complete --help")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either here or in the help (which only says "Wait until the cluster is ready"), what do you think of explaining why someone might want to wait longer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that decision is up to the user, and depends on their feelings vs. the ClusterOperator and ClusterVersion conditions we report immediately before and after these new lines. What would you add?

@stbenjam
Copy link
Member

stbenjam commented Jun 5, 2020

@stbenjam feels that the current installer output is not sufficiently clear to allow users to make informed decisions about whether waiting longer or not makes sense.

To clarify somewhat: I do think the installer should always give up at some point, I just prefer it make an informed decision rather than base it only on overall time.

This seems fine to me for now.

@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-ovirt ff06a80 link /test e2e-ovirt
ci/prow/e2e-aws-scaleup-rhel7 ff06a80 link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-openstack ff06a80 link /test e2e-openstack
ci/prow/e2e-aws-workers-rhel7 ff06a80 link /test e2e-aws-workers-rhel7

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jstuever
Copy link
Contributor

/uncc

@openshift-ci-robot openshift-ci-robot removed the request for review from jstuever October 12, 2020 16:41
@wking
Copy link
Member Author

wking commented Nov 7, 2020

Obsoleted by #4259.

/close

@openshift-ci-robot
Copy link
Contributor

@wking: Closed this PR.

Details

In response to this:

Obsoleted by #4259.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the mention-wait-for-on-install-complete-timeouts branch November 7, 2020 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants