Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jan 26, 2022

Adding a conditional that we overlooked in fdb04a7 (#4751). This will avoid distractions like:

level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
...
level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis.

In a separate commit, I'm shifting #4800's:

logrus.Error("Bootstrap failed to complete: ", err.Unwrap())
logrus.Error(err.Error())

to move them out from between the gather and gather-analysis commands. There are a few options for where to move them too, and I explain my motivation for my choice in f6db6b84d030, but I have no problem with folks telling me they disagree and where I should put them instead, as long as it's not "right between the two gather steps" ;)

@openshift-ci openshift-ci bot requested review from jhixson74 and rna-afk January 26, 2022 20:10
@wking
Copy link
Member Author

wking commented Jan 26, 2022

e2e-aws conveniently failed to bootstrap:

level=info msg=Waiting up to 30m0s (until 9:14PM) for bootstrapping to complete...
level=info msg=Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
level=info msg=Cluster operator etcd RecentBackup is Unknown with ControllerStarted: 
level=error msg=Cluster operator etcd Degraded is True with ClusterMemberController_Error::DefragController_Error::EtcdMembers_UnhealthyMembers::StaticPods_Error: ClusterMemberControllerDegraded: unhealthy members found during reconciling members
level=error msg=DefragControllerDegraded: cluster is unhealthy: 2 of 3 members are available, ip-10-0-150-63.ec2.internal is unhealthy
level=error msg=EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-150-63.ec2.internal is unhealthy
level=error msg=StaticPodsDegraded: pod/etcd-ip-10-0-143-150.ec2.internal container "etcd-health-monitor" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=etcd-health-monitor pod=etcd-ip-10-0-143-150.ec2.internal_openshift-etcd(5f5683fe817facb50c9b529918518822)
level=error msg=StaticPodsDegraded: pod/etcd-ip-10-0-150-63.ec2.internal container "etcd" started at 2022-01-26 20:49:21 +0000 UTC is still not ready
level=info msg=Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 2; 0 nodes have achieved new revision 6
level=info msg=Cluster operator insights Disabled is False with AsExpected: 
level=info msg=Cluster operator insights SCANotAvailable is True with Forbidden: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 403: {"id":"11","kind":"Error","href":"/api/accounts_mgmt/v1/errors/11","code":"ACCT-MGMT-11","reason":"Account with ID 1CzKMWTzDbeN7jOMmO5Ar1Y3Ai1 denied access to perform create on Certificate with HTTP call POST /api/accounts_mgmt/v1/certificates","operation_id":"4de6e9e5-68c2-408e-9f65-59bc1853ec20"}
level=info msg=Cluster operator network ManagementStateDegraded is False with : 
level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
level=info msg=Pulling debug logs from the bootstrap machine
level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20220126211454.tar.gz"
level=fatal msg=Bootstrap failed to complete

So that's the order I'm aiming for, although in this run the gathered analysis does not diagnose the root cause, which is likely the etcd member issue (rhbz#2040533).

Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second commit is non-controversial and looks good to me.

I am not sure about the first commit. I want anything that tells the user in words what they problem may have been with their installation to be as close to the end as possible. If BZ reports are any indication, the end user frequently does not look beyond the last few lines.

@wking
Copy link
Member Author

wking commented Feb 3, 2022

I want anything that tells the user in words what they problem may have been with their installation to be as close to the end as possible.

It's hard to know which section will be the most helpful. One possibility would be to attempt the gather, also attempt the ClusterOperator collection, and feed both into an analyzer that could prefer the one that was most useful. For example, say the bootstrap machine is pretty dead, and both resources fail to gather. The failing SSH is likely more informative, because SSH has a shallower stack on top of the OS vs. the bootstrap Kube-API server. But if the user had a misconfigured SSH agent, then the Kube-API server error is probably more informative. Do we want something that makes decisions at that level of granularity (distinguishing SSH errors between "failed to connect" and "failed to auth")? Do we just want to guess some order that seems like it will be close enough often enough? I'm completely happy to just set things up in whatever order you like, if you have a preference.

@wking wking force-pushed the only-analyze-on-successful-gather branch 3 times, most recently from 4ed34f6 to a26d649 Compare February 3, 2022 21:18
@wking wking changed the title cmd/openshift-install/create: Do not attempt analyis when we fail to gather logs cmd/openshift-install/create: Do not attempt analysis when we fail to gather logs Feb 3, 2022
@wking wking force-pushed the only-analyze-on-successful-gather branch from a26d649 to 6f19fad Compare February 3, 2022 22:17
@staebler
Copy link
Contributor

staebler commented Feb 4, 2022

The failing SSH is likely more informative, because SSH has a shallower stack on top of the OS vs. the bootstrap Kube-API server. But if the user had a misconfigured SSH agent, then the Kube-API server error is probably more informative.

My experience has been that the cause of a failing SSH is much more likely to be a misconfigured SSH agent than a bootstrap node that is not accessible. But that experience is biased towards cloud platforms, where a failing bootstrap VM is rare.

It appears that there is a lot of restructuring that can happen here and deeper analysis. But without that, I don't see much value in moving the output of the bootstrap gather error to after the bootstrap failure error.

What do you think of the following?

  1. Perform bootstrap wait and store error.
  2. Perform bootstrap gather, and print any error.
  3. Print failing console operators.
  4. Print error from bootstrap wait.
  5. If no error from bootstrap gather, perform analysis of gather bundle.
  6. If no error from bootstrap gather, print out location of gather bundle.
  7. Print generic message that bootstrapping failed.

@wking wking force-pushed the only-analyze-on-successful-gather branch from 6f19fad to 028ffd6 Compare February 4, 2022 04:22
@wking
Copy link
Member Author

wking commented Feb 4, 2022

What do you think of the following...

Works for me. Updated to match with 6f19fad53 -> aa8de21e6. I needed to touch some more places to get your step 6 in that location, and trying to break that down into discrete pivots no longer felt right to me, so it's all one big commit now, with an extended commit message discussion :p.

@wking wking force-pushed the only-analyze-on-successful-gather branch from 028ffd6 to aa8de21 Compare February 4, 2022 04:26
… gather logs

Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose
problems downloading release image, 2021-03-13, openshift#4751).  This will
avoid distraction like [1]:

  level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
  ...
  level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the
log bundle, while the second line uselessly adds that without a log
bundle, there can be no log bundle analysis.

== Order of operations

When they landed in cad45bd (installer-create: Provide user
friendly error messages during failures, 2021-03-29, openshift#4800), the
installer-client's bootstrap error logs were after the gather attempt,
and right before the fatal "Bootstrap failed to complete" bail-out.

fdb04a7 (cmd: diagnose problems downloading release image,
2021-03-13, openshift#4751) landed later, adding an attempt to analyze the
gathered logs.  At this point, the installer-client's bootstrap error
logging sat in between the gather attempt and the gather-analysis
attempt, which seems like uneccessary context switching.

Matthew suggested [2]:

1. Perform bootstrap wait and store error.
2. Perform bootstrap gather, and print any error.
3. Print failing console operators.
4. Print error from bootstrap wait.
5. If no error from bootstrap gather, perform analysis of gather bundle.
6. If no error from bootstrap gather, print out location of gather bundle.
7. Print generic message that bootstrapping failed.

so that's what I've gone with here.

In order to get step 6 that late, I had to pull it out of
logGatherBootstrap (and now that that function no longer logs on
success, I renamed it to gatherBootstrap).  Now the
newGatherBootstrapCmd function includes its own explicit logging of
the successful gather path, and we can place the create-cluster
logging down at 6 where Matthew wants it.

== Error scoping and naming

There's also no need to pick a unique name for is block-scoped
variable.  We can use the usual 'err' without clobbering the local
'err' that's scoped to the waitForBootstrapComplete block.

We only need a specific name for gatherError because we need it to
feed the deferred conditionals for 5 and 6.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
[2]: openshift#5582 (comment)
@wking wking force-pushed the only-analyze-on-successful-gather branch from aa8de21 to 08e8013 Compare February 4, 2022 05:54
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I want to check the real output from a failed bootstrapping, then I'll give this a lgtm.
/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 4, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2022
@staebler
Copy link
Contributor

staebler commented Feb 4, 2022

Here is the output from an ovirt install that failed to bootstrap. There is a known issue with running the bootstrap gather on ovirt, so this output demonstrates the experience when the bootstrap gather fails.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/5582/pull-ci-openshift-installer-master-e2e-ovirt/1489477515750674432

 level=info msg=Waiting up to 30m0s (until 6:58AM) for bootstrapping to complete...
level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
level=fatal msg=Bootstrap failed to complete 

It's a little wordy with the last three messages all saying that the bootstrap failed, but I can live with that until we can do a more in-depth and thorough consolidation.

@staebler
Copy link
Contributor

staebler commented Feb 4, 2022

Here is an output from an installation that failed to connect to the temporary control plane.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/5582/pull-ci-openshift-installer-master-okd-e2e-aws/1489477520658010112

level=info msg=Waiting up to 20m0s (until 6:44AM) for the Kubernetes API at https://api.ci-op-xdwi90fr-8b4ed.origin-ci-int-aws.dev.rhcloud.com:6443/...
level=info msg=Pulling debug logs from the bootstrap machine
level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.ci-op-xdwi90fr-8b4ed.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 52.86.144.153:6443: connect: connection refused
level=error msg=Bootstrap failed to complete: Get "https://api.ci-op-xdwi90fr-8b4ed.origin-ci-int-aws.dev.rhcloud.com:6443/version": dial tcp 52.86.144.153:6443: connect: connection refused
level=error msg=Failed waiting for Kubernetes API. This error usually happens when there is a problem on the bootstrap host that prevents creating a temporary control plane.
level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20220204064451.tar.gz"
level=fatal msg=Bootstrap failed to complete

The ordering looks good to me.
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 4, 2022
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

22 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 6, 2022

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-alibaba 38c1aec033ed7c1696051f23c05d2f278782f1fd link true /test e2e-alibaba
ci/prow/e2e-ibmcloud 08e8013 link false /test e2e-ibmcloud
ci/prow/okd-e2e-aws-upgrade 08e8013 link false /test okd-e2e-aws-upgrade
ci/prow/e2e-aws-workers-rhel7 08e8013 link false /test e2e-aws-workers-rhel7
ci/prow/okd-e2e-aws 08e8013 link false /test okd-e2e-aws
ci/prow/e2e-libvirt 08e8013 link false /test e2e-libvirt
ci/prow/e2e-aws-workers-rhel8 08e8013 link false /test e2e-aws-workers-rhel8
ci/prow/e2e-metal-ipi-ovn-ipv6 08e8013 link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-azure-upi 08e8013 link false /test e2e-azure-upi
ci/prow/e2e-ovirt 08e8013 link false /test e2e-ovirt
ci/prow/e2e-crc 08e8013 link false /test e2e-crc

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 3f318d7 into openshift:master Feb 6, 2022
@wking wking deleted the only-analyze-on-successful-gather branch February 7, 2022 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants