Skip to content

Conversation

@sadasu
Copy link
Contributor

@sadasu sadasu commented Apr 14, 2022

During cluster bring up, when nodes are unable to resolve api-int, the cluster fails to come up and the messages in the logs, several times do not hint at any connectivity issues making it difficult to detect and resolve any infrastructure setup issues.

These changes contain:

  1. Checks for the API and API-INT URLs to see if they are resolvable from the bootstrap node and logging helpful messages.
  2. Records the state of the api-url and api-int-url services
  3. Addition of validation of the API and API-INT URLs to the analyze output

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 14, 2022
@sadasu sadasu force-pushed the add-install-failure-check branch 2 times, most recently from e8e03e9 to 891fe87 Compare April 15, 2022 16:58
@sadasu sadasu changed the title WIP: Check for api-int resolution when cluster install fails Check for api-int resolution when cluster install fails Apr 15, 2022
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2022
@sadasu
Copy link
Contributor Author

sadasu commented Apr 18, 2022

/retest

Copy link
Contributor

@patrickdillon patrickdillon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unlikely that api-int is resolvable from the installer host, so this check would need to be run on the bootstrap host.

I think it would be a good idea to enumerate the failure cases we are trying to improve upon.

@jcpowermac
Copy link
Contributor

It is unlikely that api-int is resolvable from the installer host, so this check would need to be run on the bootstrap host.

right, its not. the coredns static pod running on the nodes configures the A record for api-int

see:
https://github.com/openshift/machine-config-operator/blob/master/templates/common/on-prem/files/coredns-corefile.yaml#L32-L39

and
https://github.com/openshift/machine-config-operator/blob/master/manifests/on-prem/coredns-corefile.tmpl#L27-L35

@patrickdillon
Copy link
Contributor

patrickdillon commented Apr 20, 2022

right, its not.

The installer would be able to resolve api-int in the edge case where it is installing to an existing vpc (e.g. aws, azure, gcp) and the host is running within that vpc, right?

Like I said, this is an edge case and does not affect the outcome for this PR, but trying to add context for @sadasu about why I said "unlikely." Also trying to edify myself in case I'm mistaken.

@jcpowermac
Copy link
Contributor

right, its not.

The installer would be able to resolve api-int in the edge case where it is installing to an existing vpc (e.g. aws, azure, gcp) and the host is running within that vpc, right?

Like I said, this is an edge case and does not affect the outcome for this PR, but trying to add context for @sadasu about why I said "unlikely." Also trying to edify myself in case I'm mistaken.

Sorry yep, I was thinking onprem ;-)

@sadasu sadasu force-pushed the add-install-failure-check branch from 891fe87 to 24bcc65 Compare April 29, 2022 22:08
@sadasu
Copy link
Contributor Author

sadasu commented Apr 29, 2022

@patrickdillon @jcpowermac thanks for the helpful comments. Hopefully the current location triggers this verification at the correct point on the bootstrap node on an install failure.

Couple of questions:

  1. Is *.apps expected to be resolvable at this point? Should these changes include a test to see if that URL is resolvable too?
  2. Should this check also include a ping to the API or API-INT IP address?

@sadasu sadasu force-pushed the add-install-failure-check branch 2 times, most recently from 5828d56 to 230e2e0 Compare April 29, 2022 22:27
@jstuever
Copy link
Contributor

jstuever commented May 3, 2022

/cc

@openshift-ci openshift-ci bot requested a review from jstuever May 3, 2022 17:47
Copy link
Contributor

@jstuever jstuever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a solid approach. I do have questions about how the new shell script is being referenced from bootkube.sh. I expect this is missing a source line at the top. Otherwise, looks good.

@sadasu sadasu force-pushed the add-install-failure-check branch 6 times, most recently from 2050020 to 64d16cd Compare May 4, 2022 19:05
Copy link
Contributor

@patrickdillon patrickdillon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Left some nits and ideas. Personally I would like to consider some different terminology than "DNS config."

@sadasu sadasu force-pushed the add-install-failure-check branch 4 times, most recently from 65307c8 to 84928d0 Compare May 5, 2022 22:05
@sadasu
Copy link
Contributor Author

sadasu commented Oct 8, 2022

/retest-required

1 similar comment
@sadasu
Copy link
Contributor Author

sadasu commented Oct 10, 2022

/retest-required

@sadasu sadasu force-pushed the add-install-failure-check branch 2 times, most recently from 472c869 to a8569df Compare October 10, 2022 17:06
@r4f4
Copy link
Contributor

r4f4 commented Oct 10, 2022

/test e2e-alibaba

Copy link
Contributor

@r4f4 r4f4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for some nitpicking on the indentation of some lines, it seems to be working as intended:

level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
level=error msg=The bootstrap machine is unable to resolve API and/or API-Int Server URLs
level=info msg=Successfully resolved API_INT_URL api-int.ci-op-97x1j7tm-920ba.alicloud-dev.devcluster.openshift.com
level=info msg=Unable to reach API_INT_URL's https endpoint at https://10.0.101.111:6443/version
level=info msg=It might be too early for the https://10.0.101.111:6443/version to be available. 

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/5816/pull-ci-openshift-installer-master-e2e-alibaba/1579546984006553600

On the bootstrap node, verify if the API URL and API-INT URLS are
resolvable when cluster bringup fails.
@sadasu sadasu force-pushed the add-install-failure-check branch from a8569df to 6bfb2e6 Compare October 10, 2022 23:25
@r4f4
Copy link
Contributor

r4f4 commented Oct 11, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2022
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 54d95a3 and 2 for PR HEAD 6bfb2e6 in total

@sadasu
Copy link
Contributor Author

sadasu commented Oct 11, 2022

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 3872845 and 1 for PR HEAD 6bfb2e6 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD f957eda and 0 for PR HEAD 6bfb2e6 in total

@sadasu
Copy link
Contributor Author

sadasu commented Oct 12, 2022

/skip
There is enough proof with the passing CIs that this fix works and doesn't break anything else.

@jstuever
Copy link
Contributor

/cc @patrickdillon
/uncc

@openshift-ci openshift-ci bot requested review from patrickdillon and removed request for jstuever October 12, 2022 16:07
@openshift-ci-robot
Copy link
Contributor

/hold

Revision 6bfb2e6 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 12, 2022

@sadasu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-crc b4de9f26fa2e7fc90a6babcb91f81caab819434c link false /test e2e-crc
ci/prow/e2e-gcp-upi b4de9f26fa2e7fc90a6babcb91f81caab819434c link true /test e2e-gcp-upi
ci/prow/e2e-aws-upi b4de9f26fa2e7fc90a6babcb91f81caab819434c link true /test e2e-aws-upi
ci/prow/e2e-azure-upi b4de9f26fa2e7fc90a6babcb91f81caab819434c link true /test e2e-azure-upi
ci/prow/okd-e2e-aws 0ca8925920bd26a79bc17f1a7607004f2d8bd21b link false /test okd-e2e-aws
ci/prow/e2e-azure b93e83b5e7693c4013815c0721afbb48b3405b25 link true /test e2e-azure
ci/prow/e2e-vsphere b93e83b5e7693c4013815c0721afbb48b3405b25 link true /test e2e-vsphere
ci/prow/e2e-ibmcloud b93e83b5e7693c4013815c0721afbb48b3405b25 link false /test e2e-ibmcloud
ci/prow/e2e-aws b93e83b5e7693c4013815c0721afbb48b3405b25 link true /test e2e-aws
ci/prow/e2e-gcp b93e83b5e7693c4013815c0721afbb48b3405b25 link true /test e2e-gcp
ci/prow/okd-e2e-gcp-ovn-upgrade f72166fe50e34c5912dcaa9d496050e927176167 link false /test okd-e2e-gcp-ovn-upgrade
ci/prow/okd-scos-e2e-aws-ovn 6bfb2e6 link false /test okd-scos-e2e-aws-ovn
ci/prow/okd-scos-e2e-aws-upgrade 6bfb2e6 link false /test okd-scos-e2e-aws-upgrade
ci/prow/okd-e2e-aws-upgrade 6bfb2e6 link false /test okd-e2e-aws-upgrade
ci/prow/e2e-openstack 6bfb2e6 link false /test e2e-openstack
ci/prow/e2e-ibmcloud-ovn 6bfb2e6 link false /test e2e-ibmcloud-ovn
ci/prow/e2e-alibaba a8569dfc04922c2191113dad7f4c99eec9c36a5c link false /test e2e-alibaba
ci/prow/e2e-libvirt 6bfb2e6 link false /test e2e-libvirt
ci/prow/e2e-agent-mce 6bfb2e6 link false /test e2e-agent-mce

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sadasu
Copy link
Contributor Author

sadasu commented Oct 12, 2022

/test e2e-gcp-ovn

@sadasu
Copy link
Contributor Author

sadasu commented Oct 13, 2022

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 13, 2022
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 36ebaee and 2 for PR HEAD 6bfb2e6 in total

@openshift-merge-robot openshift-merge-robot merged commit 3d0aac7 into openshift:master Oct 13, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 13, 2022

@sadasu: All pull requests linked via external trackers have merged:

Bugzilla bug 2072202 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2072202: Check for api and api-int resolution during cluster install

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants