-
Notifications
You must be signed in to change notification settings - Fork 1.5k
cmd/openshift-install/create: Do not attempt analysis when we fail to gather logs #5582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/openshift-install/create: Do not attempt analysis when we fail to gather logs #5582
Conversation
|
e2e-aws conveniently failed to bootstrap: So that's the order I'm aiming for, although in this run the gathered analysis does not diagnose the root cause, which is likely the etcd member issue (rhbz#2040533). |
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second commit is non-controversial and looks good to me.
I am not sure about the first commit. I want anything that tells the user in words what they problem may have been with their installation to be as close to the end as possible. If BZ reports are any indication, the end user frequently does not look beyond the last few lines.
It's hard to know which section will be the most helpful. One possibility would be to attempt the gather, also attempt the ClusterOperator collection, and feed both into an analyzer that could prefer the one that was most useful. For example, say the bootstrap machine is pretty dead, and both resources fail to gather. The failing SSH is likely more informative, because SSH has a shallower stack on top of the OS vs. the bootstrap Kube-API server. But if the user had a misconfigured SSH agent, then the Kube-API server error is probably more informative. Do we want something that makes decisions at that level of granularity (distinguishing SSH errors between "failed to connect" and "failed to auth")? Do we just want to guess some order that seems like it will be close enough often enough? I'm completely happy to just set things up in whatever order you like, if you have a preference. |
4ed34f6 to
a26d649
Compare
a26d649 to
6f19fad
Compare
My experience has been that the cause of a failing SSH is much more likely to be a misconfigured SSH agent than a bootstrap node that is not accessible. But that experience is biased towards cloud platforms, where a failing bootstrap VM is rare. It appears that there is a lot of restructuring that can happen here and deeper analysis. But without that, I don't see much value in moving the output of the bootstrap gather error to after the bootstrap failure error. What do you think of the following?
|
6f19fad to
028ffd6
Compare
Works for me. Updated to match with 6f19fad53 -> aa8de21e6. I needed to touch some more places to get your step 6 in that location, and trying to break that down into discrete pivots no longer felt right to me, so it's all one big commit now, with an extended commit message discussion :p. |
028ffd6 to
aa8de21
Compare
… gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. == Order of operations When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. Matthew suggested [2]: 1. Perform bootstrap wait and store error. 2. Perform bootstrap gather, and print any error. 3. Print failing console operators. 4. Print error from bootstrap wait. 5. If no error from bootstrap gather, perform analysis of gather bundle. 6. If no error from bootstrap gather, print out location of gather bundle. 7. Print generic message that bootstrapping failed. so that's what I've gone with here. In order to get step 6 that late, I had to pull it out of logGatherBootstrap (and now that that function no longer logs on success, I renamed it to gatherBootstrap). Now the newGatherBootstrapCmd function includes its own explicit logging of the successful gather path, and we can place the create-cluster logging down at 6 where Matthew wants it. == Error scoping and naming There's also no need to pick a unique name for is block-scoped variable. We can use the usual 'err' without clobbering the local 'err' that's scoped to the waitForBootstrapComplete block. We only need a specific name for gatherError because we need it to feed the deferred conditionals for 5 and 6. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50 [2]: openshift#5582 (comment)
aa8de21 to
08e8013
Compare
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I want to check the real output from a failed bootstrapping, then I'll give this a lgtm.
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: staebler The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Here is the output from an ovirt install that failed to bootstrap. There is a known issue with running the bootstrap gather on ovirt, so this output demonstrates the experience when the bootstrap gather fails. It's a little wordy with the last three messages all saying that the bootstrap failed, but I can live with that until we can do a more in-depth and thorough consolidation. |
|
Here is an output from an installation that failed to connect to the temporary control plane. The ordering looks good to me. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
4 similar comments
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
22 similar comments
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
Adding a conditional that we overlooked in fdb04a7 (#4751). This will avoid distractions like:
where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis.
In a separate commit, I'm shifting #4800's:
to move them out from between the gather and gather-analysis commands. There are a few options for where to move them too, and I explain my motivation for my choice in f6db6b84d030, but I have no problem with folks telling me they disagree and where I should put them instead, as long as it's not "right between the two gather steps" ;)