Always retry provisioning operations on failure#584
Always retry provisioning operations on failure#584honza wants to merge 1 commit intometal3-io:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: honza The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test-integration |
|
This change looks OK. It's going to be a big behavior change, but should result in fewer hosts getting "stuck" in bad states. |
|
I am a bit concerned that if the input provided by the user is incorrect, we will be continuously looping, trying to deploy. that might hide the failure (since the node would not be in error, but in changing states, between provisioning and error, probably mostly in provisioning state). Would we have a way to break an infinite loop to avoid using too much resources when we already failed multiple times, and would the last error still appear on the node ? |
|
Looping is a bit of a concern. I think that's probably a better situation than what we have now, where the host enters an error state and the user can't get it out without something relatively drastic. Maybe we could keep the error message, but clear the error state? And clear the error message when a host enters the provisioned state? |
|
I think it could be problematic to do the retry within the Provisioning state, for the reasons identified already above. |
|
FWIW today I encountered something that might be related to this. |
The solution there is probably exponential backoff, but the kubernetes thing to do is retry, it's hard to know if failures are temporary or not, and we should constantly trying to be getting to our desired state, even if it's less frequent after multiple failures I guess. |
|
I tried my hand at the |
maelk
left a comment
There was a problem hiding this comment.
What about reseting the error count whenever the deployment (or the on-going operation) succeeds ? It might be confusing to have a host successfully deployed with an error count of X, and also, it would be kept between several deployments, meaning that after the first deployment it might have failed 10 times, and after the second, it would show 15 times, while it only really failed 5 times. So I think a reset of the error count at some point would be necessary
zaneb
left a comment
There was a problem hiding this comment.
What about reseting the error count whenever the deployment (or the on-going operation) succeeds ?
+1.
I think we actually need different counts for different types of error, and to clear them when that error is resolved.
| info.log.Info("response from validate", "provResult", provResult) | ||
|
|
||
| if provResult.ErrorMessage != "" { | ||
| info.host.IncrementErrorCount() |
There was a problem hiding this comment.
Any reason not to put this inside recordActionFailure?
| "sigs.k8s.io/controller-runtime/pkg/reconcile" | ||
| ) | ||
|
|
||
| const maxBackOff = time.Hour * 24 |
There was a problem hiding this comment.
I feel like this could probably go as short as an hour or two.
|
|
||
| func calculateBackoff(errorCount int, max time.Duration) time.Duration { | ||
| backOff := math.Exp2(float64(errorCount)) | ||
| backOffDuration := time.Second * time.Duration(backOff) |
There was a problem hiding this comment.
2s is a very short back-off to start with for an operation as long as e.g. provisioning. The user may not even have time to notice that it has failed. Maybe s/Second/Minute/ here?
| if backOffDuration.Milliseconds() > max.Milliseconds() { | ||
| return max | ||
| } | ||
| return backOffDuration |
There was a problem hiding this comment.
Having fixed delays (even exponentially increasing ones) is prone to causing thundering herd problems. We should at least add some jitter on top. (We could go as far as to implement exponential backoff in the CSMA sense, where we wait for a random interval between 0 and backOffDuration, but resource contention isn't our primary reason for backing off here so my instinct is that that would be overkill.)
There was a problem hiding this comment.
We have a separate need to throttle the number of things we ask ironic to do at once anyway, so maybe we can solve the herd problem that way to avoid complicating the logic here?
| if backOffDuration.Milliseconds() > max.Milliseconds() { | ||
| return max | ||
| } | ||
| return backOffDuration |
There was a problem hiding this comment.
We have a separate need to throttle the number of things we ask ironic to do at once anyway, so maybe we can solve the herd problem that way to avoid complicating the logic here?
|
/close We are going to use #610 instead |
|
@dhellmann: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
No description provided.