Skip to content

Conversation

@osherdp
Copy link
Contributor

@osherdp osherdp commented Jul 26, 2021

We sometimes fail to make a complete installation of single-node ipi clusters.
The failure happens after completing bootstrapping the node, but for some reason we don't have an accessible API endpoint when trying to gather must-gather information.
This change will preserve bootstrap node after installation (instead of destroying it after bootstrapping completed) and apply openshift-install gather post installation. This only applies to single-node workflows.

Failure as an example:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node/1417275249031909376

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 26, 2021

@osherdp: GitHub didn't allow me to request PR reviews from the following users: osherdp.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @osherdp
/hold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 26, 2021
@osherdp osherdp force-pushed the feature/add-installer-gather-artifact branch 6 times, most recently from 2f019ad to 9243fdd Compare July 27, 2021 09:23
@osherdp osherdp force-pushed the feature/add-installer-gather-artifact branch 2 times, most recently from 585e258 to db7642e Compare July 27, 2021 12:30
Comment on lines 212 to 213
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore
it's just to trigger testing of the relevant job

Comment on lines 218 to 219
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore
it's just to trigger testing of the relevant job

@osherdp osherdp changed the title WIP Preserve bootstrap node on single-node IPI builds Jul 27, 2021
@osherdp osherdp force-pushed the feature/add-installer-gather-artifact branch from db7642e to d1cf01a Compare July 27, 2021 12:37
@osherdp osherdp force-pushed the feature/add-installer-gather-artifact branch from d1cf01a to b5565cd Compare July 27, 2021 15:41
@osherdp
Copy link
Contributor Author

osherdp commented Jul 27, 2021

/cc @wking @staebler @eranco74

@openshift-ci openshift-ci bot requested review from eranco74, staebler and wking July 27, 2021 16:05
@osherdp osherdp force-pushed the feature/add-installer-gather-artifact branch from b5565cd to 8d9aabf Compare July 27, 2021 16:59
@osherdp
Copy link
Contributor Author

osherdp commented Jul 29, 2021

also
/cc @deads2k

@openshift-ci openshift-ci bot requested a review from deads2k July 29, 2021 11:43
@deads2k
Copy link
Contributor

deads2k commented Jul 29, 2021

I'm ok preserving the bootstrap host for two weeks while we diagnose a particular problem. I'd like to have the revert of this PR opened shortly after this merges and have it lgtm'd, approved, and held until August 14.

If by Aug 7 or so, you realize that you aren't going to be able to resolve the problem in time, you should start focusing on how you will debug the problem in the field and use that solution in CI.

@osherdp
Copy link
Contributor Author

osherdp commented Jul 29, 2021

Sounds good @deads2k

@eranco74
Copy link
Contributor

eranco74 commented Aug 1, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 1, 2021
bootstrap information if not already gathered
@osherdp osherdp force-pushed the feature/add-installer-gather-artifact branch from 8d9aabf to c421970 Compare August 2, 2021 12:17
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 2, 2021
@deads2k
Copy link
Contributor

deads2k commented Aug 2, 2021

The rehearses need to show green before we should merge a change like this.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 2, 2021

@osherdp: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
ci/rehearse/openshift/cluster-network-operator/release-4.9/e2e-azure-ovn-dualstack 60ff1d7a8fa9a8c570a07fc2ea0918788734c2d9 link /test pj-rehearse
ci/rehearse/operator-framework/operator-marketplace/release-4.9/e2e-aws-serial 943541d3aebe33cec647342f970debb1f8af14e6 link /test pj-rehearse
ci/rehearse/openshift/azure-disk-csi-driver/release-4.9/e2e-azure-csi-migration 943541d3aebe33cec647342f970debb1f8af14e6 link /test pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/release-4.9/e2e-gcp-disruptive-ovn d1cf01a78ef0f0420a7c6977cfc1b88fdaecce36 link /test pj-rehearse
ci/rehearse/openshift/sandboxed-containers-operator/master/sandboxed-containers-operator-e2e d1cf01a78ef0f0420a7c6977cfc1b88fdaecce36 link /test pj-rehearse
ci/rehearse/openshift/cluster-cloud-controller-manager-operator/release-4.9/e2e-azure-ccm d1cf01a78ef0f0420a7c6977cfc1b88fdaecce36 link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure-fips d1cf01a78ef0f0420a7c6977cfc1b88fdaecce36 link /test pj-rehearse
ci/rehearse/openshift/cluster-cloud-controller-manager-operator/release-4.9/e2e-aws-ccm-install d1cf01a78ef0f0420a7c6977cfc1b88fdaecce36 link /test pj-rehearse
ci/rehearse/openshift/windows-machine-config-operator/release-4.9/azure-e2e-operator d1cf01a78ef0f0420a7c6977cfc1b88fdaecce36 link /test pj-rehearse
ci/rehearse/openshift/cluster-cloud-controller-manager-operator/release-4.9/e2e-aws-ccm d1cf01a78ef0f0420a7c6977cfc1b88fdaecce36 link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-proxy 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/redhat-developer/jenkins-operator/main/e2e 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.7-e2e-ovirt 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.8-e2e-ovirt-csi 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-azure-serial 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.5-e2e-ovirt 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure-csi 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/openshift/machine-config-operator/master/e2e-gcp-single-node 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-azure-cilium 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-workers-rhel7 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/openshift/router/release-4.9/e2e-agnostic 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-ovn-local-gateway 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/openshift/ovn-kubernetes/release-4.9/e2e-azure-ovn 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/openshift/ovn-kubernetes/release-4.9/e2e-ovn-hybrid-step-registry 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-csi-migration 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/openshift/ovn-kubernetes/release-4.9/okd-e2e-gcp-ovn 8d9aabfb3f9fcc8943a15eddcd8f05506f2dfa6f link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.1/e2e-aws-image-ecosystem c421970 link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.1/e2e-aws-builds c421970 link /test pj-rehearse
ci/rehearse/openshift/installer/release-4.9/e2e-aws-shared-vpc c421970 link /test pj-rehearse
ci/rehearse/openshift/machine-config-operator/release-4.5/e2e-ovirt c421970 link /test pj-rehearse
ci/rehearse/openshift/cluster-logging-operator/tech-preview/e2e-operator c421970 link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.2/e2e-cmd c421970 link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.9/e2e-aws c421970 link /test pj-rehearse
ci/rehearse/openshift/builder/release-4.9/e2e-aws-cgroupsv2 c421970 link /test pj-rehearse
ci/rehearse/openshift/cloud-credential-operator/release-4.9/e2e-aws-manual-oidc c421970 link /test pj-rehearse
ci/rehearse/openshift/sdn/release-4.9/e2e-aws-multitenant c421970 link /test pj-rehearse
ci/rehearse/openshift/machine-config-operator/release-4.9/e2e-gcp-single-node c421970 link /test pj-rehearse
ci/rehearse/openshift/kubernetes/release-4.9/k8s-e2e-gcp c421970 link /test pj-rehearse
ci/rehearse/openshift/installer/release-4.9/e2e-azure-shared-vpc c421970 link /test pj-rehearse
ci/rehearse/openshift/machine-config-operator/release-4.9/e2e-aws-techpreview-featuregate c421970 link /test pj-rehearse
ci/rehearse/openshift/installer/release-4.9/e2e-azure-resourcegroup c421970 link /test pj-rehearse
ci/rehearse/openshift/installer/release-4.9/e2e-gcp-shared-vpc c421970 link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.9/e2e-gcp-disruptive c421970 link /test pj-rehearse
ci/rehearse/openshift/gcp-pd-csi-driver-operator/release-4.9/e2e-gcp-csi-migration c421970 link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.9/e2e-aws-disruptive c421970 link /test pj-rehearse
ci/rehearse/openshift/installer/release-4.9/e2e-aws-upgrade c421970 link /test pj-rehearse
ci/rehearse/openshift/csi-driver-nfs/release-4.9/e2e-openstack-csi c421970 link /test pj-rehearse
ci/rehearse/openshift/cluster-logging-operator/master/e2e-gcp c421970 link /test pj-rehearse
ci/rehearse/openshift/installer/release-4.9/e2e-gcp-upgrade c421970 link /test pj-rehearse
ci/prow/pj-rehearse c421970 link /test pj-rehearse

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@osherdp
Copy link
Contributor Author

osherdp commented Aug 3, 2021

@deads2k @eranco74 I went over all failures, all are not related to my changes
either conformance tests failing, image building problem, or unrelated setup failure

@deads2k
Copy link
Contributor

deads2k commented Aug 5, 2021

Be sure to keep an eye on it after merge. Also, please open the revert as soon as this merges and link it back.

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 5, 2021
@osherdp
Copy link
Contributor Author

osherdp commented Aug 5, 2021

sure thing @deads2k! I'll do that
and thank you!

@osherdp
Copy link
Contributor Author

osherdp commented Aug 5, 2021

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2021
@eranco74
Copy link
Contributor

eranco74 commented Aug 5, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 5, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 5, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, eranco74, osherdp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot merged commit c0e8f31 into openshift:master Aug 5, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 5, 2021

@osherdp: Updated the step-registry configmap in namespace ci at cluster app.ci using the following files:

  • key ipi-install-install-commands.sh using file ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh
  • key ipi-install-install-ref.yaml using file ci-operator/step-registry/ipi/install/install/ipi-install-install-ref.yaml
  • key openshift-e2e-aws-single-node-workflow.yaml using file ci-operator/step-registry/openshift/e2e/aws/single-node/openshift-e2e-aws-single-node-workflow.yaml
  • key openshift-e2e-gcp-single-node-workflow.yaml using file ci-operator/step-registry/openshift/e2e/gcp/single-node/openshift-e2e-gcp-single-node-workflow.yaml
Details

In response to this:

We sometimes fail to make a complete installation of single-node ipi clusters.
The failure happens after completing bootstrapping the node, but for some reason we don't have an accessible API endpoint when trying to gather must-gather information.
This change will preserve bootstrap node after installation (instead of destroying it after bootstrapping completed) and apply openshift-install gather post installation. This only applies to single-node workflows.

Failure as an example:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node/1417275249031909376

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/openshift-release that referenced this pull request Aug 7, 2021
…ALL_PRESERVE_BOOTSTRAP to empty

c421970 (Preserve bootstrap node on single-node installations,
2021-07-26, openshift#20592) started setting the environment variable for all
calls.  It defaulted to 'false', apparently assuming that that meant
"keep on deleting the bootstrap resources".  But the installer
actually treats any non-empty value as "please preserve" [1].

This should avoid situations like [2,3], where the 'false' default
lead the installer to say [4,5]:

  time="2021-08-05T21:44:40Z" level=warning msg="OPENSHIFT_INSTALL_PRESERVE_BOOTSTRAP is set, not destroying bootstrap resources. Warning: this should only be used for debugging purposes, and poses a risk to cluster stability."

which broke ingress on [4]:

    level=error msg=Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: EnsureBackendPoolDeleted: failed to parse the VMAS ID : getAvailabilitySetNameByID: failed to parse the VMAS ID

which make everything that's ingress-dependent (auth, console, ...)
sad.

[1]: https://github.com/openshift/installer/blob/6d778f911e79afad8ba2ff4301eda5b5cf4d8e9e/cmd/openshift-install/create.go#L133
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1949267#c3
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1990916
[4]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure/1423392049742221312
[5]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure/1423392049742221312/artifacts/e2e-azure/ipi-install-install/artifacts/.openshift_install.log
@wking
Copy link
Member

wking commented Aug 7, 2021

Turns out this rehearsal failure was caused by this PR, and basically broke all installer-provisioned Azure jobs :p. #20978 is up with a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants