Bug 1816096: Add conditional exponential backoff for destroy module #3350

Fedosin · 2020-03-26T17:48:58Z

This commit implements a conditional exponential backoff where we increment waiting time only in case of Conflict (409) errors,
otherwise we can retry immediately.

openshift-ci-robot · 2020-03-26T17:49:00Z

@Fedosin: This pull request references Bugzilla bug 1816096, which is invalid:

expected the bug to target the "4.5.0" release, but it targets "4.4.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1816096: Set lower retries number for some functions

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Fedosin · 2020-03-26T17:50:32Z

/bugzilla refresh

openshift-ci-robot · 2020-03-26T17:50:43Z

@Fedosin: This pull request references Bugzilla bug 1816096, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pierreprinetti · 2020-03-26T18:56:39Z

If we need 2 days for deleting a resource, it might make sense to just call that a failure.

My proposal: limit the total execution time to something SRE expect (eg 2 hours?) and make sure we print an explicit, actionable error. Wdyt?

Fedosin · 2020-03-26T20:13:44Z

If we need 2 days for deleting a resource, it might make sense to just call that a failure.

Some people call it OpenStack :)

My proposal: limit the total execution time to something SRE expect (eg 2 hours?) and make sure we print an explicit, actionable error. Wdyt?

2 hours is definitely not enough. You can look at the requirements https://github.com/openshift/installer/blob/master/docs/user/openstack/kuryr.md#requirements-when-enabling-kuryr and see that we create up to 1000 ports in neutron, 100 networks and 100 security groups. So I can imagine a situation when deletion may take a day or two.

The problem is that Neutron doesn't support bulk deletion, so we have to send one request at a time and wait until the port is deleted, then repeat. On a overloaded cloud, such requests can take a very long time.

openshift-ci-robot · 2020-03-27T12:27:20Z

@Fedosin: This pull request references Bugzilla bug 1816096, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

Bug 1816096: Remove exponential sleeps when we destroy the cluster

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Fedosin · 2020-03-27T12:56:43Z

/hold

mandre · 2020-03-27T13:09:07Z

/lgtm

Fedosin · 2020-05-25T03:47:52Z

/retest

Fedosin · 2020-05-25T07:39:14Z

/retest

Fedosin · 2020-05-25T11:11:24Z

/retest

Fedosin · 2020-05-25T17:42:29Z

/retest

openshift-ci-robot · 2020-05-25T18:40:33Z

@Fedosin: This pull request references Bugzilla bug 1816096, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

Bug 1816096: OpenStack: Add conditional exponential backpoff for destroy module

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pierreprinetti · 2020-06-08T16:58:01Z

/retest
/bugzilla refresh

openshift-ci-robot · 2020-06-08T16:58:04Z

@pierreprinetti: This pull request references Bugzilla bug 1816096, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

/retest
/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pierreprinetti · 2020-11-04T15:06:08Z

/label platform/openstack
/retest

staebler · 2020-11-04T15:11:09Z

pkg/destroy/openstack/openstack.go

I suggest that you use wait.ExponentialBackoff rather than roll your own exponential backoff.

backoff := wait.Backoff{ Duration: 15 * time.Second, Factor: 1.3, Steps: 25, } wait.ExponentialBackoff(backoff, func() (bool, error) { for ; regularStepsCounter < maxRegularSteps; regularStepsCounter++ { finished, err := dFunction(opts, filter, logger) if finished { return true, err } // If we have a Conflict, then add exponential sleeping time, otherwise retry immediately. if errors.As(err, &gerr409) { logger.Debugf("Retry %v with exponential backoff because of the error: %v.", deleteFuncName, err) return false, nil } logger.Debugf("Retry %v because of the error: %v", deleteFuncName, err) } return false, wait.ErrWaitTimeout })

looks good! it slightly modifies the logic and with this we will have 10 regular attempts for each Conflict retry. But I believe it is mostly the same

Yes, the way that you wrote it you will have 10 regular attempts for each conflict retry. If you want to keep the old behavior, then keep the stepsCounter variable out of the condition function.

openshift-ci-robot · 2020-11-04T17:33:32Z

New changes are detected. LGTM label has been removed.

openshift-ci-robot · 2020-11-04T17:34:47Z

@Fedosin: This pull request references Bugzilla bug 1816096, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

Bug 1816096: Add conditional exponential backoff for destroy module

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Fedosin · 2020-11-04T17:37:22Z

/hold

This commit implements a conditional exponential backoff where we increment waiting time only in case of Conflict (409) errors, otherwise we can retry immediately.

staebler

After looking at this again, I am not convinced that this is the correct behavior, or even that the BZ is valid. The correct behavior of the destroyers is to run until either (1) everything is deleted, (2) there is an unrecoverable error such as bad credentials, or (3) the user cancels. It is not generally appropriate to give up on deleting a resource after a period of time.

/hold

openshift-merge-robot · 2020-11-04T20:05:56Z

@Fedosin: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-crc	`ce23114`	link	`/test e2e-crc`
ci/prow/e2e-ovirt	`ce23114`	link	`/test e2e-ovirt`
ci/prow/e2e-metal-ipi	`ce23114`	link	`/test e2e-metal-ipi`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

pierreprinetti · 2020-11-05T12:05:18Z

@staebler

It is not generally appropriate to give up on deleting a resource after a period of time

For what it's worth, I agree with this statement. One can always set a timeout on the command as needed. However, I think that the timeout behaviour should be consistent with the install counterpart. Do we want to enforce timeouts on both, or do we want to hang until {unrecoverable_error,success} for both and let the user wrap the command in a timeout call?

pierreprinetti · 2020-11-05T12:15:04Z

In case we go for eliminating the hardcoded timeout, I'd recommend to catch SIGTERM (timeout(1)'s default signal) and log a clearly actionable list of undeleted resources before exiting.

staebler · 2020-11-05T14:21:27Z

In case we go for eliminating the hardcoded timeout, I'd recommend to catch SIGTERM (timeout(1)'s default signal) and log a clearly actionable list of undeleted resources before exiting.

There is outstanding work to gather undeleted resources for all of the platforms. See #4270 as an example of what has been done for AWS. The additional idea of logging those undeleted resource after catching SIGTERM is a good one.

staebler · 2020-11-05T14:28:14Z

@staebler

It is not generally appropriate to give up on deleting a resource after a period of time

For what it's worth, I agree with this statement. One can always set a timeout on the command as needed. However, I think that the timeout behaviour should be consistent with the install counterpart. Do we want to enforce timeouts on both, or do we want to hang until {unrecoverable_error,success} for both and let the user wrap the command in a timeout call?

I disagree that the behavior should be consistent. For installation, if some action does not complete within a prescribed time, then that is an indication that the cluster will not install successfully. There is no use in having that action run indefinitely. For destruction, the expectation is that the installer will delete all of the resources that are found to be part of the cluster. Giving up should be a user decision not an installer decision. While attempting to delete the resources, the installer should be logging which resources the installer is having difficulty deleteing.

Fedosin · 2020-11-05T16:02:59Z

@staebler The thing is there is no a prescribed time for OpenStack. On some deployments deletion may take days (!), and we don't want to fail too early. We even had a bug for this: #2899

staebler · 2020-11-05T16:32:54Z

@staebler The thing is there is no a prescribed time for OpenStack. On some deployments deletion may take days (!), and we don't want to fail too early. We even had a bug for this: #2899

That is my point. There should not be a prescribed time for any of the destroy actions--not just for OpenStack. For #2899, the solution should not have been to increase the number of steps but rather to remove the cap on the number of steps.

pierreprinetti · 2020-11-19T09:52:29Z

/approve cancel

@staebler I'll leave it to you; I believe that this is rather a problem of consistency than functionality at this point.

openshift-ci-robot · 2020-11-19T09:53:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign fedosin after the PR has been reviewed.
You can assign the PR to them by writing /assign @fedosin in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/destroy/openstack/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2021-02-24T16:30:00Z

@Fedosin: This pull request references Bugzilla bug 1816096. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

Details

In response to this:

Bug 1816096: Add conditional exponential backoff for destroy module

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Mar 26, 2020

openshift-ci-robot requested review from jcpowermac and jstuever March 26, 2020 17:49

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Mar 26, 2020

Fedosin changed the title ~~Bug 1816096: Set lower retries number for some functions~~ Bug 1816096: OpenStack: Set lower retries number for some functions Mar 26, 2020

Fedosin force-pushed the retries_number branch from 3b97603 to 1bb8b67 Compare March 27, 2020 12:26

Fedosin changed the title ~~Bug 1816096: OpenStack: Set lower retries number for some functions~~ Bug 1816096: Remove exponential sleeps when we destroy the cluster Mar 27, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2020

openshift-ci-robot assigned mandre Mar 27, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2020

Fedosin force-pushed the retries_number branch from 1bb8b67 to 6deef4d Compare May 24, 2020 22:25

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 24, 2020

Fedosin changed the title ~~Bug 1816096: Remove exponential sleeps when we destroy the cluster~~ Bug 1816096: OpenStack: Add conditional exponential backpoff for destroy module May 25, 2020

openshift-ci-robot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label May 25, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 4, 2020

openshift-ci-robot assigned pierreprinetti Nov 4, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2020

openshift-ci-robot added platform/openstack approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 4, 2020

staebler reviewed Nov 4, 2020

View reviewed changes

Fedosin force-pushed the retries_number branch from 9973e62 to 4cccada Compare November 4, 2020 17:33

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2020

Fedosin changed the title ~~Bug 1816096: OpenStack: Add conditional exponential backoff for destroy module~~ Bug 1816096: Add conditional exponential backoff for destroy module Nov 4, 2020

openshift-ci-robot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. and removed bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. labels Nov 4, 2020

Bug 1816096: Add conditional exponential backoff for destroy module

ce23114

This commit implements a conditional exponential backoff where we increment waiting time only in case of Conflict (409) errors, otherwise we can retry immediately.

Fedosin force-pushed the retries_number branch from 4cccada to ce23114 Compare November 4, 2020 17:42

staebler reviewed Nov 4, 2020

View reviewed changes

openshift-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2020

Fedosin closed this Feb 24, 2021

Bug 1816096: Add conditional exponential backoff for destroy module #3350

Bug 1816096: Add conditional exponential backoff for destroy module #3350

Uh oh!

Conversation

Fedosin commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 26, 2020

Uh oh!

Fedosin commented Mar 26, 2020

Uh oh!

openshift-ci-robot commented Mar 26, 2020

Uh oh!

pierreprinetti commented Mar 26, 2020

Uh oh!

Fedosin commented Mar 26, 2020

Uh oh!

openshift-ci-robot commented Mar 27, 2020

Uh oh!

Fedosin commented Mar 27, 2020

Uh oh!

mandre commented Mar 27, 2020

Uh oh!

Fedosin commented May 25, 2020

Uh oh!

Fedosin commented May 25, 2020

Uh oh!

Fedosin commented May 25, 2020

Uh oh!

Fedosin commented May 25, 2020

Uh oh!

openshift-ci-robot commented May 25, 2020

Uh oh!

pierreprinetti commented Jun 8, 2020

Uh oh!

openshift-ci-robot commented Jun 8, 2020

Uh oh!

pierreprinetti commented Nov 4, 2020

Uh oh!

staebler Nov 4, 2020

Choose a reason for hiding this comment

Uh oh!

Fedosin Nov 4, 2020

Choose a reason for hiding this comment

Uh oh!

staebler Nov 4, 2020

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Nov 4, 2020

Uh oh!

openshift-ci-robot commented Nov 4, 2020

Uh oh!

Fedosin commented Nov 4, 2020

Uh oh!

staebler left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-merge-robot commented Nov 4, 2020

Uh oh!

pierreprinetti commented Nov 5, 2020

Uh oh!

pierreprinetti commented Nov 5, 2020

Uh oh!

staebler commented Nov 5, 2020

Uh oh!

staebler commented Nov 5, 2020

Uh oh!

Fedosin commented Nov 5, 2020

Uh oh!

staebler commented Nov 5, 2020

Uh oh!

pierreprinetti commented Nov 19, 2020

Uh oh!

openshift-ci-robot commented Nov 19, 2020

Uh oh!

openshift-ci-robot commented Feb 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Fedosin commented Mar 26, 2020 •

edited

Loading