Skip to content

Always retry provisioning operations on failure#584

Closed
honza wants to merge 1 commit intometal3-io:masterfrom
honza:keep-retrying
Closed

Always retry provisioning operations on failure#584
honza wants to merge 1 commit intometal3-io:masterfrom
honza:keep-retrying

Conversation

@honza
Copy link
Copy Markdown
Member

@honza honza commented Jul 10, 2020

No description provided.

@metal3-io-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: honza
To complete the pull request process, please assign dhellmann
You can assign the PR to them by writing /assign @dhellmann in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 10, 2020
@dhellmann
Copy link
Copy Markdown
Member

/test-integration

@dhellmann
Copy link
Copy Markdown
Member

/cc @maelk @zaneb

@metal3-io-bot metal3-io-bot requested review from maelk and zaneb July 10, 2020 17:45
@dhellmann
Copy link
Copy Markdown
Member

This change looks OK. It's going to be a big behavior change, but should result in fewer hosts getting "stuck" in bad states.

@maelk
Copy link
Copy Markdown
Member

maelk commented Jul 13, 2020

I am a bit concerned that if the input provided by the user is incorrect, we will be continuously looping, trying to deploy. that might hide the failure (since the node would not be in error, but in changing states, between provisioning and error, probably mostly in provisioning state). Would we have a way to break an infinite loop to avoid using too much resources when we already failed multiple times, and would the last error still appear on the node ?

@dhellmann
Copy link
Copy Markdown
Member

Looping is a bit of a concern. I think that's probably a better situation than what we have now, where the host enters an error state and the user can't get it out without something relatively drastic.

Maybe we could keep the error message, but clear the error state? And clear the error message when a host enters the provisioned state?

@zaneb
Copy link
Copy Markdown
Member

zaneb commented Jul 17, 2020

I think it could be problematic to do the retry within the Provisioning state, for the reasons identified already above.
My assumption was that what we want to do is continuously loop from Provisioning->Deprovisioning->Ready->Provisioning as long as provisioning is requested - preferably with a longish (perhaps geometrically increasing) delay between retries. Theoretically this could be accomplished by simply returning a result scheduling a requeue in actionFailure, but we should audit all of the places actionFailure is returned to make sure there aren't any that should actually be regarded as permanent failures (i.e. no need to reconcile until the user updates the resource, because the current configuration can never succeed).
I think somebody mentioned that it would be good to keep a failure count in the status as well (and that would be necessary to implement increasing backoff delays).

@n1r1
Copy link
Copy Markdown
Member

n1r1 commented Jul 21, 2020

FWIW today I encountered something that might be related to this.
I had a cluster, all bmh CRs were in registration error state. I had to open some ports in the firewall to make it work.
After doing so, I was expecting that there will be some retry, I even tried to change the CR to trigger reconciliation but it was stuck on the same state.
The only way to make it retry was to change the secret (even though it was correct) - see this code

@stbenjam
Copy link
Copy Markdown
Member

I am a bit concerned that if the input provided by the user is incorrect, we will be continuously looping, trying to deploy. that might hide the failure (since the node would not be in error, but in changing states, between provisioning and error, probably mostly in provisioning state). Would we have a way to break an infinite loop to avoid using too much resources when we already failed multiple times, and would the last error still appear on the node ?

The solution there is probably exponential backoff, but the kubernetes thing to do is retry, it's hard to know if failures are temporary or not, and we should constantly trying to be getting to our desired state, even if it's less frequent after multiple failures I guess.

@honza honza changed the title ironic: always retry provisioning operations on failure Always retry provisioning operations on failure Jul 25, 2020
@honza
Copy link
Copy Markdown
Member Author

honza commented Jul 25, 2020

I tried my hand at the actionFailed idea, including error counting, and exponential backoff. I sprinkled the error counting in somewhat liberally because I wasn't 100% sure where it's necessary.

Copy link
Copy Markdown
Member

@maelk maelk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about reseting the error count whenever the deployment (or the on-going operation) succeeds ? It might be confusing to have a host successfully deployed with an error count of X, and also, it would be kept between several deployments, meaning that after the first deployment it might have failed 10 times, and after the second, it would show 15 times, while it only really failed 5 times. So I think a reset of the error count at some point would be necessary

Copy link
Copy Markdown
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about reseting the error count whenever the deployment (or the on-going operation) succeeds ?

+1.
I think we actually need different counts for different types of error, and to clear them when that error is resolved.

info.log.Info("response from validate", "provResult", provResult)

if provResult.ErrorMessage != "" {
info.host.IncrementErrorCount()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to put this inside recordActionFailure?

"sigs.k8s.io/controller-runtime/pkg/reconcile"
)

const maxBackOff = time.Hour * 24
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this could probably go as short as an hour or two.


func calculateBackoff(errorCount int, max time.Duration) time.Duration {
backOff := math.Exp2(float64(errorCount))
backOffDuration := time.Second * time.Duration(backOff)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2s is a very short back-off to start with for an operation as long as e.g. provisioning. The user may not even have time to notice that it has failed. Maybe s/Second/Minute/ here?

if backOffDuration.Milliseconds() > max.Milliseconds() {
return max
}
return backOffDuration
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having fixed delays (even exponentially increasing ones) is prone to causing thundering herd problems. We should at least add some jitter on top. (We could go as far as to implement exponential backoff in the CSMA sense, where we wait for a random interval between 0 and backOffDuration, but resource contention isn't our primary reason for backing off here so my instinct is that that would be overkill.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a separate need to throttle the number of things we ask ironic to do at once anyway, so maybe we can solve the herd problem that way to avoid complicating the logic here?

Copy link
Copy Markdown
Member

@dhellmann dhellmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @maelk's comment about resetting the counter when the error is cleared.

if backOffDuration.Milliseconds() > max.Milliseconds() {
return max
}
return backOffDuration
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a separate need to throttle the number of things we ask ironic to do at once anyway, so maybe we can solve the herd problem that way to avoid complicating the logic here?

@dhellmann
Copy link
Copy Markdown
Member

/close

We are going to use #610 instead

@metal3-io-bot
Copy link
Copy Markdown
Contributor

@dhellmann: Closed this PR.

Details

In response to this:

/close

We are going to use #610 instead

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants