cmd/openshift-install/create: Do not attempt analysis when we fail to gather logs #5582

wking · 2022-01-26T20:07:57Z

Adding a conditional that we overlooked in fdb04a7 (#4751). This will avoid distractions like:

level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
...
level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis.

In a separate commit, I'm shifting #4800's:

logrus.Error("Bootstrap failed to complete: ", err.Unwrap())
logrus.Error(err.Error())

to move them out from between the gather and gather-analysis commands. There are a few options for where to move them too, and I explain my motivation for my choice in f6db6b84d030, but I have no problem with folks telling me they disagree and where I should put them instead, as long as it's not "right between the two gather steps" ;)

wking · 2022-01-26T23:57:56Z

e2e-aws conveniently failed to bootstrap:

level=info msg=Waiting up to 30m0s (until 9:14PM) for bootstrapping to complete...
level=info msg=Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
level=info msg=Cluster operator etcd RecentBackup is Unknown with ControllerStarted: 
level=error msg=Cluster operator etcd Degraded is True with ClusterMemberController_Error::DefragController_Error::EtcdMembers_UnhealthyMembers::StaticPods_Error: ClusterMemberControllerDegraded: unhealthy members found during reconciling members
level=error msg=DefragControllerDegraded: cluster is unhealthy: 2 of 3 members are available, ip-10-0-150-63.ec2.internal is unhealthy
level=error msg=EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-150-63.ec2.internal is unhealthy
level=error msg=StaticPodsDegraded: pod/etcd-ip-10-0-143-150.ec2.internal container "etcd-health-monitor" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=etcd-health-monitor pod=etcd-ip-10-0-143-150.ec2.internal_openshift-etcd(5f5683fe817facb50c9b529918518822)
level=error msg=StaticPodsDegraded: pod/etcd-ip-10-0-150-63.ec2.internal container "etcd" started at 2022-01-26 20:49:21 +0000 UTC is still not ready
level=info msg=Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 2; 0 nodes have achieved new revision 6
level=info msg=Cluster operator insights Disabled is False with AsExpected: 
level=info msg=Cluster operator insights SCANotAvailable is True with Forbidden: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 403: {"id":"11","kind":"Error","href":"/api/accounts_mgmt/v1/errors/11","code":"ACCT-MGMT-11","reason":"Account with ID 1CzKMWTzDbeN7jOMmO5Ar1Y3Ai1 denied access to perform create on Certificate with HTTP call POST /api/accounts_mgmt/v1/certificates","operation_id":"4de6e9e5-68c2-408e-9f65-59bc1853ec20"}
level=info msg=Cluster operator network ManagementStateDegraded is False with : 
level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
level=info msg=Pulling debug logs from the bootstrap machine
level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20220126211454.tar.gz"
level=fatal msg=Bootstrap failed to complete

So that's the order I'm aiming for, although in this run the gathered analysis does not diagnose the root cause, which is likely the etcd member issue (rhbz#2040533).

staebler

The second commit is non-controversial and looks good to me.

I am not sure about the first commit. I want anything that tells the user in words what they problem may have been with their installation to be as close to the end as possible. If BZ reports are any indication, the end user frequently does not look beyond the last few lines.

cmd/openshift-install/create.go

wking · 2022-02-03T01:42:38Z

I want anything that tells the user in words what they problem may have been with their installation to be as close to the end as possible.

It's hard to know which section will be the most helpful. One possibility would be to attempt the gather, also attempt the ClusterOperator collection, and feed both into an analyzer that could prefer the one that was most useful. For example, say the bootstrap machine is pretty dead, and both resources fail to gather. The failing SSH is likely more informative, because SSH has a shallower stack on top of the OS vs. the bootstrap Kube-API server. But if the user had a misconfigured SSH agent, then the Kube-API server error is probably more informative. Do we want something that makes decisions at that level of granularity (distinguishing SSH errors between "failed to connect" and "failed to auth")? Do we just want to guess some order that seems like it will be close enough often enough? I'm completely happy to just set things up in whatever order you like, if you have a preference.

staebler · 2022-02-04T01:28:09Z

The failing SSH is likely more informative, because SSH has a shallower stack on top of the OS vs. the bootstrap Kube-API server. But if the user had a misconfigured SSH agent, then the Kube-API server error is probably more informative.

My experience has been that the cause of a failing SSH is much more likely to be a misconfigured SSH agent than a bootstrap node that is not accessible. But that experience is biased towards cloud platforms, where a failing bootstrap VM is rare.

It appears that there is a lot of restructuring that can happen here and deeper analysis. But without that, I don't see much value in moving the output of the bootstrap gather error to after the bootstrap failure error.

What do you think of the following?

Perform bootstrap wait and store error.
Perform bootstrap gather, and print any error.
Print failing console operators.
Print error from bootstrap wait.
If no error from bootstrap gather, perform analysis of gather bundle.
If no error from bootstrap gather, print out location of gather bundle.
Print generic message that bootstrapping failed.

wking · 2022-02-04T04:24:25Z

What do you think of the following...

Works for me. Updated to match with 6f19fad53 -> aa8de21e6. I needed to touch some more places to get your step 6 in that location, and trying to break that down into discrete pivots no longer felt right to me, so it's all one big commit now, with an extended commit message discussion :p.

… gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. == Order of operations When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. Matthew suggested [2]: 1. Perform bootstrap wait and store error. 2. Perform bootstrap gather, and print any error. 3. Print failing console operators. 4. Print error from bootstrap wait. 5. If no error from bootstrap gather, perform analysis of gather bundle. 6. If no error from bootstrap gather, print out location of gather bundle. 7. Print generic message that bootstrapping failed. so that's what I've gone with here. In order to get step 6 that late, I had to pull it out of logGatherBootstrap (and now that that function no longer logs on success, I renamed it to gatherBootstrap). Now the newGatherBootstrapCmd function includes its own explicit logging of the successful gather path, and we can place the create-cluster logging down at 6 where Matthew wants it. == Error scoping and naming There's also no need to pick a unique name for is block-scoped variable. We can use the usual 'err' without clobbering the local 'err' that's scoped to the waitForBootstrapComplete block. We only need a specific name for gatherError because we need it to feed the deferred conditionals for 5 and 6. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50 [2]: openshift#5582 (comment)

staebler

Looks good to me. I want to check the real output from a failed bootstrapping, then I'll give this a lgtm.
/approve

openshift-ci · 2022-02-04T13:49:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [staebler]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

staebler · 2022-02-04T13:51:40Z

Here is the output from an ovirt install that failed to bootstrap. There is a known issue with running the bootstrap gather on ovirt, so this output demonstrates the experience when the bootstrap gather fails.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/5582/pull-ci-openshift-installer-master-e2e-ovirt/1489477515750674432

 level=info msg=Waiting up to 30m0s (until 6:58AM) for bootstrapping to complete...
level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
level=fatal msg=Bootstrap failed to complete

It's a little wordy with the last three messages all saying that the bootstrap failed, but I can live with that until we can do a more in-depth and thorough consolidation.

staebler · 2022-02-04T13:56:35Z

Here is an output from an installation that failed to connect to the temporary control plane.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/5582/pull-ci-openshift-installer-master-okd-e2e-aws/1489477520658010112

level=info msg=Waiting up to 20m0s (until 6:44AM) for the Kubernetes API at https://api.ci-op-xdwi90fr-8b4ed.origin-ci-int-aws.dev.rhcloud.com:6443/...
level=info msg=Pulling debug logs from the bootstrap machine
level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.ci-op-xdwi90fr-8b4ed.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 52.86.144.153:6443: connect: connection refused
level=error msg=Bootstrap failed to complete: Get "https://api.ci-op-xdwi90fr-8b4ed.origin-ci-int-aws.dev.rhcloud.com:6443/version": dial tcp 52.86.144.153:6443: connect: connection refused
level=error msg=Failed waiting for Kubernetes API. This error usually happens when there is a problem on the bootstrap host that prevents creating a temporary control plane.
level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20220204064451.tar.gz"
level=fatal msg=Bootstrap failed to complete

The ordering looks good to me.
/lgtm

openshift-bot · 2022-02-04T14:39:28Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-04T15:31:27Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-04T17:13:28Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-04T18:49:35Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-04T19:01:29Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T04:13:44Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T06:01:44Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T06:37:42Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T08:01:41Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T10:01:44Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T10:13:44Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T10:49:43Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T11:01:43Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T11:25:44Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T12:13:44Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T12:25:43Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T14:52:38Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T18:13:38Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T18:49:38Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T19:37:39Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-05T23:01:38Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-06T00:49:38Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-06T01:49:40Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-06T03:01:39Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-06T03:37:38Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-06T04:01:40Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-06T04:49:38Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-06T05:25:38Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2022-02-06T06:02:05Z

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-alibaba	38c1aec033ed7c1696051f23c05d2f278782f1fd	link	true	`/test e2e-alibaba`
ci/prow/e2e-ibmcloud	`08e8013`	link	false	`/test e2e-ibmcloud`
ci/prow/okd-e2e-aws-upgrade	`08e8013`	link	false	`/test okd-e2e-aws-upgrade`
ci/prow/e2e-aws-workers-rhel7	`08e8013`	link	false	`/test e2e-aws-workers-rhel7`
ci/prow/okd-e2e-aws	`08e8013`	link	false	`/test okd-e2e-aws`
ci/prow/e2e-libvirt	`08e8013`	link	false	`/test e2e-libvirt`
ci/prow/e2e-aws-workers-rhel8	`08e8013`	link	false	`/test e2e-aws-workers-rhel8`
ci/prow/e2e-metal-ipi-ovn-ipv6	`08e8013`	link	false	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-azure-upi	`08e8013`	link	false	`/test e2e-azure-upi`
ci/prow/e2e-ovirt	`08e8013`	link	false	`/test e2e-ovirt`
ci/prow/e2e-crc	`08e8013`	link	false	`/test e2e-crc`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2022-02-06T06:13:40Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-02-06T06:25:40Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-ci bot requested review from jhixson74 and rna-afk January 26, 2022 20:10

staebler reviewed Feb 1, 2022

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

wking force-pushed the only-analyze-on-successful-gather branch 3 times, most recently from 4ed34f6 to a26d649 Compare February 3, 2022 21:18

wking changed the title ~~cmd/openshift-install/create: Do not attempt analyis when we fail to gather logs~~ cmd/openshift-install/create: Do not attempt analysis when we fail to gather logs Feb 3, 2022

wking force-pushed the only-analyze-on-successful-gather branch from a26d649 to 6f19fad Compare February 3, 2022 22:17

wking force-pushed the only-analyze-on-successful-gather branch from 6f19fad to 028ffd6 Compare February 4, 2022 04:22

wking force-pushed the only-analyze-on-successful-gather branch from 028ffd6 to aa8de21 Compare February 4, 2022 04:26

wking force-pushed the only-analyze-on-successful-gather branch from aa8de21 to 08e8013 Compare February 4, 2022 05:54

staebler reviewed Feb 4, 2022

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2022

openshift-ci bot assigned staebler Feb 4, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 4, 2022

openshift-merge-robot merged commit 3f318d7 into openshift:master Feb 6, 2022

wking deleted the only-analyze-on-successful-gather branch February 7, 2022 17:52

cmd/openshift-install/create: Do not attempt analysis when we fail to gather logs #5582

cmd/openshift-install/create: Do not attempt analysis when we fail to gather logs #5582

Uh oh!

Conversation

wking commented Jan 26, 2022

Uh oh!

wking commented Jan 26, 2022

Uh oh!

staebler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wking commented Feb 3, 2022

Uh oh!

staebler commented Feb 4, 2022

Uh oh!

wking commented Feb 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

staebler left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Feb 4, 2022

Uh oh!

staebler commented Feb 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

staebler commented Feb 4, 2022

Uh oh!

openshift-bot commented Feb 4, 2022

Uh oh!

openshift-bot commented Feb 4, 2022

Uh oh!

openshift-bot commented Feb 4, 2022

Uh oh!

openshift-bot commented Feb 4, 2022

Uh oh!

openshift-bot commented Feb 4, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 5, 2022

Uh oh!

openshift-bot commented Feb 6, 2022

Uh oh!

openshift-bot commented Feb 6, 2022

Uh oh!

openshift-bot commented Feb 6, 2022

Uh oh!

openshift-bot commented Feb 6, 2022

Uh oh!

openshift-bot commented Feb 6, 2022

wking commented Feb 4, 2022 •

edited

Loading

staebler commented Feb 4, 2022 •

edited

Loading

openshift-ci bot commented Feb 6, 2022 •

edited

Loading