-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Preserve bootstrap node on single-node IPI builds #20592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve bootstrap node on single-node IPI builds #20592
Conversation
|
@osherdp: GitHub didn't allow me to request PR reviews from the following users: osherdp. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
2f019ad to
9243fdd
Compare
ci-operator/config/openshift/origin/openshift-origin-master.yaml
Outdated
Show resolved
Hide resolved
...ator/step-registry/openshift/e2e/aws/single-node/openshift-e2e-aws-single-node-workflow.yaml
Outdated
Show resolved
Hide resolved
585e258 to
db7642e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore
it's just to trigger testing of the relevant job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore
it's just to trigger testing of the relevant job
db7642e to
d1cf01a
Compare
ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh
Outdated
Show resolved
Hide resolved
d1cf01a to
b5565cd
Compare
b5565cd to
8d9aabf
Compare
|
also |
|
I'm ok preserving the bootstrap host for two weeks while we diagnose a particular problem. I'd like to have the revert of this PR opened shortly after this merges and have it lgtm'd, approved, and held until August 14. If by Aug 7 or so, you realize that you aren't going to be able to resolve the problem in time, you should start focusing on how you will debug the problem in the field and use that solution in CI. |
|
Sounds good @deads2k |
|
/lgtm |
bootstrap information if not already gathered
8d9aabf to
c421970
Compare
|
The rehearses need to show green before we should merge a change like this. |
|
@osherdp: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Be sure to keep an eye on it after merge. Also, please open the revert as soon as this merges and link it back. /approve |
|
sure thing @deads2k! I'll do that |
|
/unhold |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, eranco74, osherdp The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@osherdp: Updated the
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…ALL_PRESERVE_BOOTSTRAP to empty c421970 (Preserve bootstrap node on single-node installations, 2021-07-26, openshift#20592) started setting the environment variable for all calls. It defaulted to 'false', apparently assuming that that meant "keep on deleting the bootstrap resources". But the installer actually treats any non-empty value as "please preserve" [1]. This should avoid situations like [2,3], where the 'false' default lead the installer to say [4,5]: time="2021-08-05T21:44:40Z" level=warning msg="OPENSHIFT_INSTALL_PRESERVE_BOOTSTRAP is set, not destroying bootstrap resources. Warning: this should only be used for debugging purposes, and poses a risk to cluster stability." which broke ingress on [4]: level=error msg=Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: EnsureBackendPoolDeleted: failed to parse the VMAS ID : getAvailabilitySetNameByID: failed to parse the VMAS ID which make everything that's ingress-dependent (auth, console, ...) sad. [1]: https://github.com/openshift/installer/blob/6d778f911e79afad8ba2ff4301eda5b5cf4d8e9e/cmd/openshift-install/create.go#L133 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1949267#c3 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1990916 [4]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure/1423392049742221312 [5]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure/1423392049742221312/artifacts/e2e-azure/ipi-install-install/artifacts/.openshift_install.log
|
Turns out this rehearsal failure was caused by this PR, and basically broke all installer-provisioned Azure jobs :p. #20978 is up with a fix. |
We sometimes fail to make a complete installation of single-node ipi clusters.
The failure happens after completing bootstrapping the node, but for some reason we don't have an accessible API endpoint when trying to gather must-gather information.
This change will preserve bootstrap node after installation (instead of destroying it after bootstrapping completed) and apply
openshift-install gatherpost installation. This only applies to single-node workflows.Failure as an example:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node/1417275249031909376