-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Bug 1816096: Add conditional exponential backoff for destroy module #3350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@Fedosin: This pull request references Bugzilla bug 1816096, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh |
|
@Fedosin: This pull request references Bugzilla bug 1816096, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
If we need 2 days for deleting a resource, it might make sense to just call that a failure. My proposal: limit the total execution time to something SRE expect (eg 2 hours?) and make sure we print an explicit, actionable error. Wdyt? |
Some people call it OpenStack :)
2 hours is definitely not enough. You can look at the requirements https://github.com/openshift/installer/blob/master/docs/user/openstack/kuryr.md#requirements-when-enabling-kuryr and see that we create up to 1000 ports in neutron, 100 networks and 100 security groups. So I can imagine a situation when deletion may take a day or two. The problem is that Neutron doesn't support bulk deletion, so we have to send one request at a time and wait until the port is deleted, then repeat. On a overloaded cloud, such requests can take a very long time. |
|
@Fedosin: This pull request references Bugzilla bug 1816096, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/hold |
|
/lgtm |
|
/retest |
3 similar comments
|
/retest |
|
/retest |
|
/retest |
|
@Fedosin: This pull request references Bugzilla bug 1816096, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest |
|
@pierreprinetti: This pull request references Bugzilla bug 1816096, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/label platform/openstack |
pkg/destroy/openstack/openstack.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that you use wait.ExponentialBackoff rather than roll your own exponential backoff.
backoff := wait.Backoff{
Duration: 15 * time.Second,
Factor: 1.3,
Steps: 25,
}
wait.ExponentialBackoff(backoff, func() (bool, error) {
for ; regularStepsCounter < maxRegularSteps; regularStepsCounter++ {
finished, err := dFunction(opts, filter, logger)
if finished {
return true, err
}
// If we have a Conflict, then add exponential sleeping time, otherwise retry immediately.
if errors.As(err, &gerr409) {
logger.Debugf("Retry %v with exponential backoff because of the error: %v.", deleteFuncName, err)
return false, nil
}
logger.Debugf("Retry %v because of the error: %v", deleteFuncName, err)
}
return false, wait.ErrWaitTimeout
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! it slightly modifies the logic and with this we will have 10 regular attempts for each Conflict retry. But I believe it is mostly the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the way that you wrote it you will have 10 regular attempts for each conflict retry. If you want to keep the old behavior, then keep the stepsCounter variable out of the condition function.
9973e62 to
4cccada
Compare
|
New changes are detected. LGTM label has been removed. |
|
@Fedosin: This pull request references Bugzilla bug 1816096, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/hold |
This commit implements a conditional exponential backoff where we increment waiting time only in case of Conflict (409) errors, otherwise we can retry immediately.
4cccada to
ce23114
Compare
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at this again, I am not convinced that this is the correct behavior, or even that the BZ is valid. The correct behavior of the destroyers is to run until either (1) everything is deleted, (2) there is an unrecoverable error such as bad credentials, or (3) the user cancels. It is not generally appropriate to give up on deleting a resource after a period of time.
/hold
|
@Fedosin: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
For what it's worth, I agree with this statement. One can always set a |
|
In case we go for eliminating the hardcoded timeout, I'd recommend to catch SIGTERM ( |
There is outstanding work to gather undeleted resources for all of the platforms. See #4270 as an example of what has been done for AWS. The additional idea of logging those undeleted resource after catching SIGTERM is a good one. |
I disagree that the behavior should be consistent. For installation, if some action does not complete within a prescribed time, then that is an indication that the cluster will not install successfully. There is no use in having that action run indefinitely. For destruction, the expectation is that the installer will delete all of the resources that are found to be part of the cluster. Giving up should be a user decision not an installer decision. While attempting to delete the resources, the installer should be logging which resources the installer is having difficulty deleteing. |
That is my point. There should not be a |
|
/approve cancel @staebler I'll leave it to you; I believe that this is rather a problem of consistency than functionality at this point. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@Fedosin: This pull request references Bugzilla bug 1816096. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This commit implements a conditional exponential backoff where we increment waiting time only in case of Conflict (409) errors,
otherwise we can retry immediately.