Skip to content

CORS-3437: infra/capi: add provisioning timeout#8307

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
patrickdillon:capi-timeout
Apr 27, 2024
Merged

CORS-3437: infra/capi: add provisioning timeout#8307
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
patrickdillon:capi-timeout

Conversation

@patrickdillon
Copy link
Contributor

Implements basic safeguard so that provisioning does not spin indefinitely. A timeout (currently 15m) is set for each provisioning stage.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 24, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 24, 2024

@patrickdillon: This pull request references CORS-3437 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

Implements basic safeguard so that provisioning does not spin indefinitely. A timeout (currently 15m) is set for each provisioning stage.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from andfasano and bfournie April 24, 2024 03:36
@patrickdillon
Copy link
Contributor Author

/test altinfra-e2e-aws-ovn altinfra-e2e-azure-capi-ovn altinfra-e2e-nutanix-capi-ovn altinfra-e2e-vsphere-capi-ovn altinfra-e2e-openstack-capi-ovn altinfra-e2e-gcp-capi-ovn

Comment on lines 114 to 116
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to remove these timeouts from the hooks and only keep the timeout caps on the provisioning with the capi system. We don't need to handle hook failures with timeouts, instead we can leave that up to the hook implementors. For CAPI provisioning on the other hand, we need a timeout to prevent endlessly spinning.

Adds a 15m timeout to the infrastructure provisioning and machine
provisioning stages of CAPI, so that the controllers do not
spin indefinitely in the case of a failure. 15m is an arbitrary
value, but the criteria for the timeout should be based on the
balance of ample time to provision the resources with not
making users wait too long if something goes wrong.
@patrickdillon
Copy link
Contributor Author

/test altinfra-e2e-aws-ovn altinfra-e2e-azure-capi-ovn altinfra-e2e-nutanix-capi-ovn altinfra-e2e-vsphere-capi-ovn altinfra-e2e-openstack-capi-ovn altinfra-e2e-gcp-capi-ovn

@r4f4
Copy link
Contributor

r4f4 commented Apr 24, 2024

/cc

@openshift-ci openshift-ci bot requested a review from r4f4 April 24, 2024 20:11
@r4f4
Copy link
Contributor

r4f4 commented Apr 24, 2024

Nice!

time="2024-04-24T19:56:53Z" level=debug msg="Time elapsed per stage:"
time="2024-04-24T19:56:53Z" level=debug msg="  Infrastructure PreProvisioning: 3s"
time="2024-04-24T19:56:53Z" level=debug msg="     Infrastructure Provisioning: 5m29s"
time="2024-04-24T19:56:53Z" level=debug msg="InfrastructureReady Provisioning: 31s"
time="2024-04-24T19:56:53Z" level=debug msg="            Machine Provisioning: 25s"
time="2024-04-24T19:56:53Z" level=debug msg="              Bootstrap Complete: 11m38s"
time="2024-04-24T19:56:53Z" level=debug msg="                             API: 3m36s"
time="2024-04-24T19:56:53Z" level=debug msg="               Bootstrap Destroy: 2m1s"
time="2024-04-24T19:56:53Z" level=debug msg="     Cluster Operators Available: 16m1s"
time="2024-04-24T19:56:53Z" level=debug msg="        Cluster Operators Stable: 1m37s"
time="2024-04-24T19:56:53Z" level=info msg="Time elapsed: 37m59s"

Copy link
Contributor

@r4f4 r4f4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about adding

               untilTime := time.Now().Add(timeout)
               timezone, _ := untilTime.Zone()
               logrus.Infof("Waiting up to %v (until %v %s) for infrastructure to provision...", timeout, untilTime.Format(time.Kitchen), timezone)

Too much information?

@patrickdillon
Copy link
Contributor Author

/test altinfra-e2e-aws-ovn

@patrickdillon
Copy link
Contributor Author

All comments incorporated.

Copy link
Contributor

@r4f4 r4f4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will wait for the linting fix before tagging.

@r4f4
Copy link
Contributor

r4f4 commented Apr 26, 2024

Looking good:

time="2024-04-25T20:44:52Z" level=info msg="Waiting up to 15m0s (until 8:59PM UTC) for infrastructure to become ready..."
[...]
time="2024-04-25T20:50:14Z" level=info msg="Creating private Hosted Zone"
[...]
time="2024-04-25T20:50:45Z" level=info msg="Waiting up to 15m0s (until 9:05PM UTC) for machines to provision..."
[...]
time="2024-04-25T20:51:10Z" level=debug msg="Machine ci-op-fnm19l75-2d061-p5kqn-master-0 is ready. Phase: Provisioned"
time="2024-04-25T20:51:10Z" level=debug msg="Machine ci-op-fnm19l75-2d061-p5kqn-master-1 is ready. Phase: Provisioned"
time="2024-04-25T20:51:10Z" level=debug msg="Machine ci-op-fnm19l75-2d061-p5kqn-master-2 is ready. Phase: Provisioned"
time="2024-04-25T20:51:10Z" level=info msg="Cluster API resources have been created. Waiting for cluster to become ready..."
[...]
time="2024-04-25T21:18:38Z" level=debug msg="Time elapsed per stage:"
time="2024-04-25T21:18:38Z" level=debug msg="     Infrastructure Provisioning: 5m29s"
time="2024-04-25T21:18:38Z" level=debug msg="InfrastructureReady Provisioning: 31s"
time="2024-04-25T21:18:38Z" level=debug msg="            Machine Provisioning: 26s"
time="2024-04-25T21:18:38Z" level=debug msg="              Bootstrap Complete: 12m20s"
time="2024-04-25T21:18:38Z" level=debug msg="                             API: 5m41s"
time="2024-04-25T21:18:38Z" level=debug msg="               Bootstrap Destroy: 2m1s"
time="2024-04-25T21:18:38Z" level=debug msg="     Cluster Operators Available: 9m8s"
time="2024-04-25T21:18:38Z" level=debug msg="        Cluster Operators Stable: 3m58s"
time="2024-04-25T21:18:38Z" level=info msg="Time elapsed: 34m4s"

Adds timers to each stage of CAPI infrastructure provisioning. These
times will be logged at install complete, and can be used as a guide
if we need to change the provisioning timeouts.
@patrickdillon
Copy link
Contributor Author

fixed linter and reworked stage names a little: "InfrastructureReady" would not be obvious to users.

/test altinfra-e2e-aws-ovn altinfra-e2e-azure-capi-ovn altinfra-e2e-nutanix-capi-ovn altinfra-e2e-vsphere-capi-ovn altinfra-e2e-openstack-capi-ovn altinfra-e2e-gcp-capi-ovn

Copy link
Contributor

@r4f4 r4f4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 27, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 27, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: r4f4

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 9c8cfd4 into openshift:master Apr 27, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 27, 2024

@patrickdillon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/altinfra-e2e-gcp-capi-ovn 4e2c8f6 link false /test altinfra-e2e-gcp-capi-ovn
ci/prow/okd-e2e-aws-ovn-upgrade 4e2c8f6 link false /test okd-e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-installer-altinfra-container-v4.16.0-202404291018.p0.g9c8cfd4.assembly.stream.el9 for distgit ose-installer-altinfra.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants