-
Notifications
You must be signed in to change notification settings - Fork 1.5k
installer-create: Provide user friendly error messages during failures #4800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
installer-create: Provide user friendly error messages during failures #4800
Conversation
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to make this a little cleaner. Let's put the added text into a new log event rather than shoving it into the existing fatal event.
One approach is to have the waitForBootstrapComplete function return a custom-typed error from which the caller can obtain the additional text.
882f004 to
331cdbe
Compare
cmd/openshift-install/create.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swap the order of these.
| logrus.Error(errInfo.GetLogMessage()) | |
| logrus.Fatal("Bootstrap failed to complete: ", err) | |
| logrus.Fatal("Bootstrap failed to complete: ", err) | |
| logrus.Error(errInfo.GetLogMessage()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't a call to the Fatal function exit the program?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL. It sure does. Make the last one the fatal one. Or add a shorter final fatal.
| logrus.Error(errInfo.GetLogMessage()) | |
| logrus.Fatal("Bootstrap failed to complete: ", err) | |
| logrus.Error("Bootstrap failed to complete: ", err) | |
| logrus.Error(errInfo.GetLogMessage()) | |
| logrus.Fatal("Bootstrap failed to complete") |
pkg/types/installconfig.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't put these new types in the pkg/types package. The package is for user-facing types. You can keep the types as private type in the cmd/openshift-install package.
pkg/types/installconfig.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Unwrap so that the ClusterCreateError can be treated as a wrapped error.
| GetError() error | |
| Unwrap() error |
pkg/types/installconfig.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this function is not very descriptive. Maybe something like CauseDetail?
pkg/types/installconfig.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a custom type for each kind of error. Make ClusterCreateError a struct and store the cause detail in a string in the struct. You can still keep the NewXXXError functions, which will populate the cause details.
pkg/types/installconfig.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say "creating the control plane" to keep it consistent with the cause detail in APIError.
| "the control plane operators to successfully run the control plane." | |
| "the control plane operators from creating the control plane." |
cmd/openshift-install/create.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are missing a nil check here before calling GetError. But really, you don't need to store errInfo separately.
| errInfo := waitForBootstrapComplete(ctx, config) | |
| err = errInfo.GetError() | |
| if err != nil { | |
| if err := waitForBootstrapComplete(ctx, config); err != nil { |
331cdbe to
27273c7
Compare
cmd/openshift-install/create.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| errorInfo error | |
| wrappedError error |
cmd/openshift-install/create.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func (ce clusterCreateError) unwrap() error { | |
| func (ce clusterCreateError) Unwrap() error { |
cmd/openshift-install/create.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define the Error function, too, so that clusterCreateError implements error.
cmd/openshift-install/create.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are returning a pointer, then the receiver for the unwrap and causeDetail functions should be pointers, too.
cmd/openshift-install/create.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use err instead of errInfo. It is OK to hide the err declared in the outer scope. I don't know what errInfo means.
27273c7 to
0d7f8d9
Compare
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rna-afk This looks good. Could you show me the last 20 lines or so of the installer log for the two scenarios germane to this PR?
- the temporary control plane does not come up
- the bootstrapping does not complete afterwards
|
Seems to work, here are the two error messages |
Wonderful. Thank you. Can you capitalize the start of the log message and add sentences? |
0d7f8d9 to
bf729cd
Compare
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: staebler The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm cancel
cmd/openshift-install/create.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Left this one in from your testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry my bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. It happens to everyone.
Refines the error messages that are provided during cluster installation if the API or the bootstrap fails to come up within the given time to provide more info to the user about what would have likely caused the error. This would provide a suggestion to the user about where to go look for the specific log file for the error.
bf729cd to
cad45bd
Compare
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
/retest Please review the full test history for this PR and help us cut down flakes. |
4 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@rna-afk: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
These changes have no effect on upgrades. They only effect failed installations. |
|
@staebler: Overrode contexts on behalf of staebler: ci/prow/e2e-aws-upgrade DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
… gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. == Order of operations When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. Matthew suggested [2]: 1. Perform bootstrap wait and store error. 2. Perform bootstrap gather, and print any error. 3. Print failing console operators. 4. Print error from bootstrap wait. 5. If no error from bootstrap gather, perform analysis of gather bundle. 6. If no error from bootstrap gather, print out location of gather bundle. 7. Print generic message that bootstrapping failed. so that's what I've gone with here. In order to get step 6 that late, I had to pull it out of logGatherBootstrap (and now that that function no longer logs on success, I renamed it to gatherBootstrap). Now the newGatherBootstrapCmd function includes its own explicit logging of the successful gather path, and we can place the create-cluster logging down at 6 where Matthew wants it. == Error scoping and naming There's also no need to pick a unique name for is block-scoped variable. We can use the usual 'err' without clobbering the local 'err' that's scoped to the waitForBootstrapComplete block. We only need a specific name for gatherError because we need it to feed the deferred conditionals for 5 and 6. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50 [2]: openshift#5582 (comment)
Refines the error messages that are provided during cluster installation
if the API or the bootstrap fails to come up within the given time
to provide more info to the user about what would have likely
caused the error. This would provide a suggestion to the user
about where to go look for the specific log file for the error.