Always retry provisioning operations on failure by honza · Pull Request #584 · metal3-io/baremetal-operator

honza · 2020-07-10T13:13:51Z

No description provided.

metal3-io-bot · 2020-07-10T13:13:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: honza
To complete the pull request process, please assign dhellmann
You can assign the PR to them by writing /assign @dhellmann in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dhellmann · 2020-07-10T17:44:51Z

/test-integration

dhellmann · 2020-07-10T17:45:28Z

/cc @maelk @zaneb

dhellmann · 2020-07-10T17:46:31Z

This change looks OK. It's going to be a big behavior change, but should result in fewer hosts getting "stuck" in bad states.

maelk · 2020-07-13T11:49:00Z

I am a bit concerned that if the input provided by the user is incorrect, we will be continuously looping, trying to deploy. that might hide the failure (since the node would not be in error, but in changing states, between provisioning and error, probably mostly in provisioning state). Would we have a way to break an infinite loop to avoid using too much resources when we already failed multiple times, and would the last error still appear on the node ?

dhellmann · 2020-07-13T14:17:13Z

Looping is a bit of a concern. I think that's probably a better situation than what we have now, where the host enters an error state and the user can't get it out without something relatively drastic.

Maybe we could keep the error message, but clear the error state? And clear the error message when a host enters the provisioned state?

zaneb · 2020-07-17T14:48:47Z

I think it could be problematic to do the retry within the Provisioning state, for the reasons identified already above.
My assumption was that what we want to do is continuously loop from Provisioning->Deprovisioning->Ready->Provisioning as long as provisioning is requested - preferably with a longish (perhaps geometrically increasing) delay between retries. Theoretically this could be accomplished by simply returning a result scheduling a requeue in actionFailure, but we should audit all of the places actionFailure is returned to make sure there aren't any that should actually be regarded as permanent failures (i.e. no need to reconcile until the user updates the resource, because the current configuration can never succeed).
I think somebody mentioned that it would be good to keep a failure count in the status as well (and that would be necessary to implement increasing backoff delays).

n1r1 · 2020-07-21T15:05:46Z

FWIW today I encountered something that might be related to this.
I had a cluster, all bmh CRs were in registration error state. I had to open some ports in the firewall to make it work.
After doing so, I was expecting that there will be some retry, I even tried to change the CR to trigger reconciliation but it was stuck on the same state.
The only way to make it retry was to change the secret (even though it was correct) - see this code

stbenjam · 2020-07-21T18:21:15Z

I am a bit concerned that if the input provided by the user is incorrect, we will be continuously looping, trying to deploy. that might hide the failure (since the node would not be in error, but in changing states, between provisioning and error, probably mostly in provisioning state). Would we have a way to break an infinite loop to avoid using too much resources when we already failed multiple times, and would the last error still appear on the node ?

The solution there is probably exponential backoff, but the kubernetes thing to do is retry, it's hard to know if failures are temporary or not, and we should constantly trying to be getting to our desired state, even if it's less frequent after multiple failures I guess.

honza · 2020-07-25T18:54:09Z

I tried my hand at the actionFailed idea, including error counting, and exponential backoff. I sprinkled the error counting in somewhat liberally because I wasn't 100% sure where it's necessary.

maelk

What about reseting the error count whenever the deployment (or the on-going operation) succeeds ? It might be confusing to have a host successfully deployed with an error count of X, and also, it would be kept between several deployments, meaning that after the first deployment it might have failed 10 times, and after the second, it would show 15 times, while it only really failed 5 times. So I think a reset of the error count at some point would be necessary

zaneb

What about reseting the error count whenever the deployment (or the on-going operation) succeeds ?

+1.
I think we actually need different counts for different types of error, and to clear them when that error is resolved.

zaneb · 2020-07-27T18:31:01Z

 	info.log.Info("response from validate", "provResult", provResult)

 	if provResult.ErrorMessage != "" {
+		info.host.IncrementErrorCount()


Any reason not to put this inside recordActionFailure?

zaneb · 2020-07-27T18:37:29Z

+	"sigs.k8s.io/controller-runtime/pkg/reconcile"
 )

+const maxBackOff = time.Hour * 24


I feel like this could probably go as short as an hour or two.

zaneb · 2020-07-27T18:46:16Z

+
+func calculateBackoff(errorCount int, max time.Duration) time.Duration {
+	backOff := math.Exp2(float64(errorCount))
+	backOffDuration := time.Second * time.Duration(backOff)


2s is a very short back-off to start with for an operation as long as e.g. provisioning. The user may not even have time to notice that it has failed. Maybe s/Second/Minute/ here?

zaneb · 2020-07-27T19:10:33Z

+	if backOffDuration.Milliseconds() > max.Milliseconds() {
+		return max
+	}
+	return backOffDuration


Having fixed delays (even exponentially increasing ones) is prone to causing thundering herd problems. We should at least add some jitter on top. (We could go as far as to implement exponential backoff in the CSMA sense, where we wait for a random interval between 0 and backOffDuration, but resource contention isn't our primary reason for backing off here so my instinct is that that would be overkill.)

We have a separate need to throttle the number of things we ask ironic to do at once anyway, so maybe we can solve the herd problem that way to avoid complicating the logic here?

dhellmann

I agree with @maelk's comment about resetting the counter when the error is cleared.

dhellmann · 2020-07-29T14:19:17Z

+	if backOffDuration.Milliseconds() > max.Milliseconds() {
+		return max
+	}
+	return backOffDuration


We have a separate need to throttle the number of things we ask ironic to do at once anyway, so maybe we can solve the herd problem that way to avoid complicating the logic here?

dhellmann · 2020-08-05T17:10:49Z

/close

We are going to use #610 instead

metal3-io-bot · 2020-08-05T17:10:51Z

@dhellmann: Closed this PR.

Details

In response to this:

/close

We are going to use #610 instead

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

metal3-io-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 10, 2020

metal3-io-bot requested review from maelk and zaneb July 10, 2020 17:45

honza changed the title ~~ironic: always retry provisioning operations on failure~~ Always retry provisioning operations on failure Jul 25, 2020

honza force-pushed the keep-retrying branch from 87f4921 to aeb360e Compare July 25, 2020 18:54

Keep retrying failed operations with backoff

22f9383

honza force-pushed the keep-retrying branch from aeb360e to 22f9383 Compare July 25, 2020 19:16

maelk reviewed Jul 27, 2020

View reviewed changes

zaneb reviewed Jul 27, 2020

View reviewed changes

dhellmann reviewed Jul 29, 2020

View reviewed changes

andfasano mentioned this pull request Jul 30, 2020

Always retry provisioning operations on failure (continue) #610

Merged

metal3-io-bot closed this Aug 5, 2020

Conversation

honza commented Jul 10, 2020

Uh oh!

metal3-io-bot commented Jul 10, 2020

Uh oh!

dhellmann commented Jul 10, 2020

Uh oh!

dhellmann commented Jul 10, 2020

Uh oh!

dhellmann commented Jul 10, 2020

Uh oh!

maelk commented Jul 13, 2020

Uh oh!

dhellmann commented Jul 13, 2020

Uh oh!

zaneb commented Jul 17, 2020

Uh oh!

n1r1 commented Jul 21, 2020

Uh oh!

stbenjam commented Jul 21, 2020

Uh oh!

honza commented Jul 25, 2020

Uh oh!

maelk left a comment

Choose a reason for hiding this comment

Uh oh!

zaneb left a comment

Choose a reason for hiding this comment

Uh oh!

zaneb Jul 27, 2020

Choose a reason for hiding this comment

Uh oh!

zaneb Jul 27, 2020

Choose a reason for hiding this comment

Uh oh!

zaneb Jul 27, 2020

Choose a reason for hiding this comment

Uh oh!

zaneb Jul 27, 2020

Choose a reason for hiding this comment

Uh oh!

dhellmann Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

dhellmann left a comment

Choose a reason for hiding this comment

Uh oh!

dhellmann Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

dhellmann commented Aug 5, 2020

Uh oh!

metal3-io-bot commented Aug 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants