Skip to content

Conversation

@Fedosin
Copy link
Contributor

@Fedosin Fedosin commented Mar 26, 2020

This commit implements a conditional exponential backoff where we increment waiting time only in case of Conflict (409) errors,
otherwise we can retry immediately.

@openshift-ci-robot
Copy link
Contributor

@Fedosin: This pull request references Bugzilla bug 1816096, which is invalid:

  • expected the bug to target the "4.5.0" release, but it targets "4.4.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1816096: Set lower retries number for some functions

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Mar 26, 2020
@Fedosin
Copy link
Contributor Author

Fedosin commented Mar 26, 2020

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Mar 26, 2020
@openshift-ci-robot
Copy link
Contributor

@Fedosin: This pull request references Bugzilla bug 1816096, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Fedosin Fedosin changed the title Bug 1816096: Set lower retries number for some functions Bug 1816096: OpenStack: Set lower retries number for some functions Mar 26, 2020
@pierreprinetti
Copy link
Member

If we need 2 days for deleting a resource, it might make sense to just call that a failure.

My proposal: limit the total execution time to something SRE expect (eg 2 hours?) and make sure we print an explicit, actionable error. Wdyt?

@Fedosin
Copy link
Contributor Author

Fedosin commented Mar 26, 2020

If we need 2 days for deleting a resource, it might make sense to just call that a failure.

Some people call it OpenStack :)

My proposal: limit the total execution time to something SRE expect (eg 2 hours?) and make sure we print an explicit, actionable error. Wdyt?

2 hours is definitely not enough. You can look at the requirements https://github.com/openshift/installer/blob/master/docs/user/openstack/kuryr.md#requirements-when-enabling-kuryr and see that we create up to 1000 ports in neutron, 100 networks and 100 security groups. So I can imagine a situation when deletion may take a day or two.

The problem is that Neutron doesn't support bulk deletion, so we have to send one request at a time and wait until the port is deleted, then repeat. On a overloaded cloud, such requests can take a very long time.

@Fedosin Fedosin changed the title Bug 1816096: OpenStack: Set lower retries number for some functions Bug 1816096: Remove exponential sleeps when we destroy the cluster Mar 27, 2020
@openshift-ci-robot
Copy link
Contributor

@Fedosin: This pull request references Bugzilla bug 1816096, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1816096: Remove exponential sleeps when we destroy the cluster

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Fedosin
Copy link
Contributor Author

Fedosin commented Mar 27, 2020

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2020
@mandre
Copy link
Member

mandre commented Mar 27, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2020
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 24, 2020
@Fedosin
Copy link
Contributor Author

Fedosin commented May 25, 2020

/retest

3 similar comments
@Fedosin
Copy link
Contributor Author

Fedosin commented May 25, 2020

/retest

@Fedosin
Copy link
Contributor Author

Fedosin commented May 25, 2020

/retest

@Fedosin
Copy link
Contributor Author

Fedosin commented May 25, 2020

/retest

@Fedosin Fedosin changed the title Bug 1816096: Remove exponential sleeps when we destroy the cluster Bug 1816096: OpenStack: Add conditional exponential backpoff for destroy module May 25, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label May 25, 2020
@openshift-ci-robot
Copy link
Contributor

@Fedosin: This pull request references Bugzilla bug 1816096, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1816096: OpenStack: Add conditional exponential backpoff for destroy module

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pierreprinetti
Copy link
Member

/retest
/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@pierreprinetti: This pull request references Bugzilla bug 1816096, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

/retest
/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 4, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2020
@pierreprinetti
Copy link
Member

/label platform/openstack
/retest

@openshift-ci-robot openshift-ci-robot added platform/openstack approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 4, 2020
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that you use wait.ExponentialBackoff rather than roll your own exponential backoff.

	backoff := wait.Backoff{
		Duration: 15 * time.Second,
		Factor:   1.3,
		Steps:    25,
	}
	wait.ExponentialBackoff(backoff, func() (bool, error) {
		for ; regularStepsCounter < maxRegularSteps; regularStepsCounter++ {
			finished, err := dFunction(opts, filter, logger)
			if finished {
				return true, err
			}

			// If we have a Conflict, then add exponential sleeping time, otherwise retry immediately.
			if errors.As(err, &gerr409) {
				logger.Debugf("Retry %v with exponential backoff because of the error: %v.", deleteFuncName, err)
				return false, nil
			}
			
			logger.Debugf("Retry %v because of the error: %v", deleteFuncName, err)
		}
		return false, wait.ErrWaitTimeout
	})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! it slightly modifies the logic and with this we will have 10 regular attempts for each Conflict retry. But I believe it is mostly the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the way that you wrote it you will have 10 regular attempts for each conflict retry. If you want to keep the old behavior, then keep the stepsCounter variable out of the condition function.

@openshift-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2020
@Fedosin Fedosin changed the title Bug 1816096: OpenStack: Add conditional exponential backoff for destroy module Bug 1816096: Add conditional exponential backoff for destroy module Nov 4, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. and removed bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. labels Nov 4, 2020
@openshift-ci-robot
Copy link
Contributor

@Fedosin: This pull request references Bugzilla bug 1816096, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1816096: Add conditional exponential backoff for destroy module

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Fedosin
Copy link
Contributor Author

Fedosin commented Nov 4, 2020

/hold

This commit implements a conditional exponential backoff where we
increment waiting time only in case of Conflict (409) errors,
otherwise we can retry immediately.
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at this again, I am not convinced that this is the correct behavior, or even that the BZ is valid. The correct behavior of the destroyers is to run until either (1) everything is deleted, (2) there is an unrecoverable error such as bad credentials, or (3) the user cancels. It is not generally appropriate to give up on deleting a resource after a period of time.

/hold

@openshift-merge-robot
Copy link
Contributor

@Fedosin: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-crc ce23114 link /test e2e-crc
ci/prow/e2e-ovirt ce23114 link /test e2e-ovirt
ci/prow/e2e-metal-ipi ce23114 link /test e2e-metal-ipi

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@pierreprinetti
Copy link
Member

@staebler

It is not generally appropriate to give up on deleting a resource after a period of time

For what it's worth, I agree with this statement. One can always set a timeout on the command as needed. However, I think that the timeout behaviour should be consistent with the install counterpart. Do we want to enforce timeouts on both, or do we want to hang until {unrecoverable_error,success} for both and let the user wrap the command in a timeout call?

@pierreprinetti
Copy link
Member

In case we go for eliminating the hardcoded timeout, I'd recommend to catch SIGTERM (timeout(1)'s default signal) and log a clearly actionable list of undeleted resources before exiting.

@staebler
Copy link
Contributor

staebler commented Nov 5, 2020

In case we go for eliminating the hardcoded timeout, I'd recommend to catch SIGTERM (timeout(1)'s default signal) and log a clearly actionable list of undeleted resources before exiting.

There is outstanding work to gather undeleted resources for all of the platforms. See #4270 as an example of what has been done for AWS. The additional idea of logging those undeleted resource after catching SIGTERM is a good one.

@staebler
Copy link
Contributor

staebler commented Nov 5, 2020

@staebler

It is not generally appropriate to give up on deleting a resource after a period of time

For what it's worth, I agree with this statement. One can always set a timeout on the command as needed. However, I think that the timeout behaviour should be consistent with the install counterpart. Do we want to enforce timeouts on both, or do we want to hang until {unrecoverable_error,success} for both and let the user wrap the command in a timeout call?

I disagree that the behavior should be consistent. For installation, if some action does not complete within a prescribed time, then that is an indication that the cluster will not install successfully. There is no use in having that action run indefinitely. For destruction, the expectation is that the installer will delete all of the resources that are found to be part of the cluster. Giving up should be a user decision not an installer decision. While attempting to delete the resources, the installer should be logging which resources the installer is having difficulty deleteing.

@Fedosin
Copy link
Contributor Author

Fedosin commented Nov 5, 2020

@staebler The thing is there is no a prescribed time for OpenStack. On some deployments deletion may take days (!), and we don't want to fail too early. We even had a bug for this: #2899

@staebler
Copy link
Contributor

staebler commented Nov 5, 2020

@staebler The thing is there is no a prescribed time for OpenStack. On some deployments deletion may take days (!), and we don't want to fail too early. We even had a bug for this: #2899

That is my point. There should not be a prescribed time for any of the destroy actions--not just for OpenStack. For #2899, the solution should not have been to increase the number of steps but rather to remove the cap on the number of steps.

@pierreprinetti
Copy link
Member

/approve cancel

@staebler I'll leave it to you; I believe that this is rather a problem of consistency than functionality at this point.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign fedosin after the PR has been reviewed.
You can assign the PR to them by writing /assign @fedosin in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2020
@Fedosin Fedosin closed this Feb 24, 2021
@openshift-ci-robot
Copy link
Contributor

@Fedosin: This pull request references Bugzilla bug 1816096. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

Details

In response to this:

Bug 1816096: Add conditional exponential backoff for destroy module

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. platform/openstack

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants