Skip to content

Conversation

@rna-afk
Copy link
Contributor

@rna-afk rna-afk commented Mar 29, 2021

Refines the error messages that are provided during cluster installation
if the API or the bootstrap fails to come up within the given time
to provide more info to the user about what would have likely
caused the error. This would provide a suggestion to the user
about where to go look for the specific log file for the error.

Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to make this a little cleaner. Let's put the added text into a new log event rather than shoving it into the existing fatal event.

One approach is to have the waitForBootstrapComplete function return a custom-typed error from which the caller can obtain the additional text.

@rna-afk rna-afk force-pushed the api_bootstrap_error_message_cleanup branch 2 times, most recently from 882f004 to 331cdbe Compare March 30, 2021 14:57
Comment on lines 124 to 125
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swap the order of these.

Suggested change
logrus.Error(errInfo.GetLogMessage())
logrus.Fatal("Bootstrap failed to complete: ", err)
logrus.Fatal("Bootstrap failed to complete: ", err)
logrus.Error(errInfo.GetLogMessage())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't a call to the Fatal function exit the program?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL. It sure does. Make the last one the fatal one. Or add a shorter final fatal.

Suggested change
logrus.Error(errInfo.GetLogMessage())
logrus.Fatal("Bootstrap failed to complete: ", err)
logrus.Error("Bootstrap failed to complete: ", err)
logrus.Error(errInfo.GetLogMessage())
logrus.Fatal("Bootstrap failed to complete")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't put these new types in the pkg/types package. The package is for user-facing types. You can keep the types as private type in the cmd/openshift-install package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Unwrap so that the ClusterCreateError can be treated as a wrapped error.

Suggested change
GetError() error
Unwrap() error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this function is not very descriptive. Maybe something like CauseDetail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a custom type for each kind of error. Make ClusterCreateError a struct and store the cause detail in a string in the struct. You can still keep the NewXXXError functions, which will populate the cause details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say "creating the control plane" to keep it consistent with the cause detail in APIError.

Suggested change
"the control plane operators to successfully run the control plane."
"the control plane operators from creating the control plane."

Comment on lines 115 to 117
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are missing a nil check here before calling GetError. But really, you don't need to store errInfo separately.

Suggested change
errInfo := waitForBootstrapComplete(ctx, config)
err = errInfo.GetError()
if err != nil {
if err := waitForBootstrapComplete(ctx, config); err != nil {

@rna-afk rna-afk force-pushed the api_bootstrap_error_message_cleanup branch from 331cdbe to 27273c7 Compare March 30, 2021 19:15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
errorInfo error
wrappedError error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (ce clusterCreateError) unwrap() error {
func (ce clusterCreateError) Unwrap() error {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define the Error function, too, so that clusterCreateError implements error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are returning a pointer, then the receiver for the unwrap and causeDetail functions should be pointers, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use err instead of errInfo. It is OK to hide the err declared in the outer scope. I don't know what errInfo means.

@rna-afk rna-afk force-pushed the api_bootstrap_error_message_cleanup branch from 27273c7 to 0d7f8d9 Compare March 30, 2021 21:12
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rna-afk This looks good. Could you show me the last 20 lines or so of the installer log for the two scenarios germane to this PR?

  1. the temporary control plane does not come up
  2. the bootstrapping does not complete afterwards

@rna-afk
Copy link
Contributor Author

rna-afk commented Mar 30, 2021

Seems to work, here are the two error messages

[anarayan@ip-10-0-5-26 ~]$ ./openshift-install create cluster --dir=test2
INFO Consuming Install Config from target directory 
INFO Obtaining RHCOS image file from 'https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.8/48.83.202103221318-0/x86_64/rhcos-48.83.202103221318-0-vmware.x86_64.ova?sha256=' 
INFO Creating infrastructure resources...         
INFO Waiting up to 1m0s for the Kubernetes API at https://api.anarayan.vmc.devcluster.openshift.com:6443... 
ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.anarayan.vmc.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 172.31.250.130:6443: connect: no route to host 
ERROR failed to lookup master.0 ipv4 address: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/sdk": context deadline exceeded 
INFO Pulling debug logs from the bootstrap machine 
INFO Bootstrap gather logs captured here "/home/anarayan/test2/log-bundle-20210330232242.tar.gz" 
ERROR Bootstrap failed to complete: Get "https://api.anarayan.vmc.devcluster.openshift.com:6443/version?timeout=32s": dial tcp 172.31.250.130:6443: connect: no route to host 
ERROR failed waiting for Kubernetes API, this error usually happens when there is a problem on the bootstrap host that prevents creating a temporary control plane 
FATAL Bootstrap failed to complete   
[anarayan@ip-10-0-5-26 ~]$ ./openshift-install create cluster --dir=test2
INFO Consuming Install Config from target directory 
INFO Obtaining RHCOS image file from 'https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.8/48.83.202103221318-0/x86_64/rhcos-48.83.202103221318-0-vmware.x86_64.ova?sha256=' 
INFO The file was found in cache: /home/anarayan/.cache/openshift-installer/image_cache/rhcos-48.83.202103221318-0-vmware.x86_64.ova. Reusing... 
INFO Creating infrastructure resources...         
INFO Waiting up to 20m0s for the Kubernetes API at https://api.anarayan.vmc.devcluster.openshift.com:6443... 
INFO API v1.20.0-1099+93f62caeaf322e-dirty up     
INFO Waiting up to 1m0s for bootstrapping to complete... 
INFO Pulling debug logs from the bootstrap machine 
INFO Bootstrap gather logs captured here "/home/anarayan/test2/log-bundle-20210330234013.tar.gz" 
ERROR Bootstrap failed to complete: timed out waiting for the condition 
ERROR failed to wait for bootstrapping to complete, this error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane. 
FATAL Bootstrap failed to complete 

@staebler
Copy link
Contributor

Seems to work, here are the two error messages

Wonderful. Thank you. Can you capitalize the start of the log message and add sentences?
Example,

Failed waiting for Kubernetes API. This error usually happens when there is a problem on the bootstrap host that prevents creating a temporary control plane.

@rna-afk rna-afk force-pushed the api_bootstrap_error_message_cleanup branch from 0d7f8d9 to bf729cd Compare March 30, 2021 23:51
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 31, 2021
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm cancel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Left this one in from your testing.

Copy link
Contributor Author

@rna-afk rna-afk Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry my bad

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. It happens to everyone.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2021
Refines the error messages that are provided during cluster installation
if the API or the bootstrap fails to come up within the given time
to provide more info to the user about what would have likely
caused the error. This would provide a suggestion to the user
about where to go look for the specific log file for the error.
@rna-afk rna-afk force-pushed the api_bootstrap_error_message_cleanup branch from bf729cd to cad45bd Compare March 31, 2021 01:11
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2021
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 31, 2021

@rna-afk: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-fips cad45bd link /test e2e-aws-fips
ci/prow/e2e-crc cad45bd link /test e2e-crc
ci/prow/e2e-aws-workers-rhel7 cad45bd link /test e2e-aws-workers-rhel7

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@staebler
Copy link
Contributor

These changes have no effect on upgrades. They only effect failed installations.
/override ci/prow/e2e-aws-upgrade
/skip

@openshift-ci-robot
Copy link
Contributor

@staebler: Overrode contexts on behalf of staebler: ci/prow/e2e-aws-upgrade

Details

In response to this:

These changes have no effect on upgrades. They only effect failed installations.
/override ci/prow/e2e-aws-upgrade
/skip

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 8c9ffa4 into openshift:master Mar 31, 2021
wking added a commit to wking/openshift-installer that referenced this pull request Feb 4, 2022
… gather logs

Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose
problems downloading release image, 2021-03-13, openshift#4751).  This will
avoid distraction like [1]:

  level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
  ...
  level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the
log bundle, while the second line uselessly adds that without a log
bundle, there can be no log bundle analysis.

== Order of operations

When they landed in cad45bd (installer-create: Provide user
friendly error messages during failures, 2021-03-29, openshift#4800), the
installer-client's bootstrap error logs were after the gather attempt,
and right before the fatal "Bootstrap failed to complete" bail-out.

fdb04a7 (cmd: diagnose problems downloading release image,
2021-03-13, openshift#4751) landed later, adding an attempt to analyze the
gathered logs.  At this point, the installer-client's bootstrap error
logging sat in between the gather attempt and the gather-analysis
attempt, which seems like uneccessary context switching.

Matthew suggested [2]:

1. Perform bootstrap wait and store error.
2. Perform bootstrap gather, and print any error.
3. Print failing console operators.
4. Print error from bootstrap wait.
5. If no error from bootstrap gather, perform analysis of gather bundle.
6. If no error from bootstrap gather, print out location of gather bundle.
7. Print generic message that bootstrapping failed.

so that's what I've gone with here.

In order to get step 6 that late, I had to pull it out of
logGatherBootstrap (and now that that function no longer logs on
success, I renamed it to gatherBootstrap).  Now the
newGatherBootstrapCmd function includes its own explicit logging of
the successful gather path, and we can place the create-cluster
logging down at 6 where Matthew wants it.

== Error scoping and naming

There's also no need to pick a unique name for is block-scoped
variable.  We can use the usual 'err' without clobbering the local
'err' that's scoped to the waitForBootstrapComplete block.

We only need a specific name for gatherError because we need it to
feed the deferred conditionals for 5 and 6.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
[2]: openshift#5582 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants