Do retries with backoff in ValidateManagementAccess, Inspect, and Deprovision#749
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zaneb The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test-integration |
f892927 to
b1cc1f5
Compare
|
/test-integration |
|
@zaneb thanks a lot for the patch. I will test the patch with the scenario I was simulating with vbmc problem and come back report it here. |
hey @furkatgofurov7 could you please share some additional details about the tested scenario? I'd like also to try reproducing it if possible |
|
Hi @andfasano! sure thing, I have explained it in the issue itself in short, but I will add detailed steps for reproducing it as well. |
| } | ||
|
|
||
| provResult, err := prov.ValidateManagementAccess(credsChanged) | ||
| provResult, err := prov.ValidateManagementAccess(credsChanged, info.host.Status.ErrorType == metal3v1alpha1.RegistrationError) |
There was a problem hiding this comment.
Minor: what about pushing this check directly within the related provisioner method (ValidateManagementAccess in this case)? It will help in keeping such logic within the provisioner code, and at the same time will minimize the impacts on the interface
There was a problem hiding this comment.
The controller (not the provisioner) owns setting the errors and knowing the types, so conceptually I don't feel like this belongs in the provisioner.
There was a problem hiding this comment.
The current implementation seems pretty fixed for every case, ie Adopt checks always for RegistrationError
| } else { | ||
| hsm.NextState = metal3v1alpha1.StateInspecting | ||
| } | ||
| hsm.Host.Status.ErrorCount = 0 |
There was a problem hiding this comment.
Probably it could make sense to bind the error count reset to the actionComplete (maybe with an utility method like recordActionComplete), otherwise my feeling is that it could be easily missed next time some new code will be added handling a completed action.
There was a problem hiding this comment.
It was nice having it be magic, but there's no current place to hang it - actionComplete doesn't work for the steady states.
The fact that we can't move the code into the state machine for the steady states indicates a design problem, and maybe once that is resolved we'll have a convenient place to tie it.
There was a problem hiding this comment.
Agree that there's probably something to be reviewed in the steady states, as in the current implementation the ErrorCount is cleared when:
- When completing an action and moving from/to:
- Registering -> Inspecting/Ext.Prov
- Inspecting -> Match Profile
- Provisioning -> Provisioned
- Deprovisioning -> Deleting / Ready
- Due the power management within the steady states (Ext.Prov.//Provisioned ) and Ready
But at least for point 1 having an utility function that set the actionComplete and resets the ErrorCount could help in reducing the scattering.
Nice work, I have tested the patch locally to test the scenario described in the issue, and it works flawlessly! |
furkatgofurov7
left a comment
There was a problem hiding this comment.
Looks good to me.
/cc @dhellmann
|
Looks also good to me |
|
This needs a rebase though. /cc @zaneb |
|
@furkatgofurov7: GitHub didn't allow me to request PR reviews from the following users: zaneb. Note that only metal3-io members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@zaneb @furkatgofurov7 I think it could be useful in this scope also to land #757 as it contains a fix for ErrorCount |
When a call to the provisioner succeeds rather than returning an error message, that's a good sign, and a reason to not have an error message set in the Host object. But it doesn't guarantee success: if the previous failure came at the end of some long-running operation (rather than outright rejection at the beginning), it could yet fail in exactly the same way. Clearing the ErrorType as soon as we start an operation allows us to use that field to determine whether to force the provisioner to retry in the presence of an error. (It will only be set if the last thing we did was record an error, therefore if we see it then we are at the beginning of a new retry.) One known issue with this is that if a request to change state in Ironic results in a 409 Conflict response, this just sets the Dirty flag and is indistinguishable from success when viewed outside the Ironic provisioner. If such a conflict occurs, we will effectively skip a retry attempt, increment the ErrorCount again, and sleep for even longer before the next attempt. To ensure that we actually do exponential backoff between retries, leave the ErrorCount unchanged until the whole operation actually succeeds (Dirty is no longer true). This is now decoupled from the ClearError() method that clears the ErrorType and ErrorMessage. As much as possible, do the clearing of ErrorCount in the host state machine. The exception is the steady states where we only do power management - the actionResult types currently lack enough detail to distinguish when count should be cleared when viewed from the state machine. In power management states (Ready/Available, Provisioned, Externally Provisioned), count the number of errors of any type since the power state was last successfully changed. Otherwise, the ErrorCount is cleared when an operation succesfully completes. Successful re-registration or adoption is never sufficient to clear the error count, except in the registration state.
b1cc1f5 to
c8c033f
Compare
Once we see an error in the Node, it returns to the 'enroll' [sic] state and we don't have a way of determining whether we have seen and saved that error or not. Previously we always assumed we had not, and didn't retry the validation unless the credentials had changed. Add a force flag to indicate that this is a new attempt and should start again. Fixes metal3-io#739
If inspection fails at some point before it is actually started in ironic-inspector, we were just repeatedly retrying it instead of setting an error. Instead, set an error and retry only after a backoff.
Error is the state that the Node goes into if deleting (i.e. deprovisioning) it fails. The only valid action from this state is to try deleting again, so do that rather than attempt to go straight to manageable.
If deprovisioning fails, we were just repeatedly retrying it instead of setting an error. Instead, set an error and retry only after a backoff. If we are deprovisioning because of a deletion, give up after 3 retries (since this may indicate that the host simply doesn't exist any more) and allow the Host to be deleted.
c8c033f to
7c3daef
Compare
|
Rebased. |
|
LGTM, leaving it up to @andfasano to give a final lgtm |
|
/lgtm |
|
Thanks @furkatgofurov7, even if late I was fine with them |
As discussed in #739, in order to decide whether an error we see from Ironic is one that we have seen and dealt with before (in which case we should retry the present operation) or not (in which case we should deal with the error) we need to store some sort of state.
This PR uses a differential between the ErrorType/ErrorMessage and the ErrorCount to encode this differential. The actual error is cleared whenever we successfully start or continue a new operation, but the error count is preserved until the operation is fully complete. This allows us to both determine when an error is new (when no error is currently recorded in the Status) and yet still do exponential backoff when multiple consecutive errors occur.
This fixes issues with registering, adopting, inspecting, and deprovisioning. The issues were different in each case due to a patchwork of implementations, which are now more consistent.
Some deprovisioning errors that were previously unrecoverable can now be retried. On deletion of the Host, we will retry deprovisioning up to 3 times before giving up and allowing the Host to disappear (previously there was no way to delete it in some states, other than manually removing the finalizer).
One known issue with this is that if a request to change state in Ironic results in a 409 Conflict response, this just sets the Dirty flag and is indistinguishable from success when viewed outside the Ironic provisioner. If such a conflict occurs, we will effectively skip a retry attempt, increment the ErrorCount again, and sleep for even longer before the next attempt.