Do retries with backoff in ValidateManagementAccess, Inspect, and Deprovision by zaneb · Pull Request #749 · metal3-io/baremetal-operator

zaneb · 2020-12-11T04:34:29Z

As discussed in #739, in order to decide whether an error we see from Ironic is one that we have seen and dealt with before (in which case we should retry the present operation) or not (in which case we should deal with the error) we need to store some sort of state.

This PR uses a differential between the ErrorType/ErrorMessage and the ErrorCount to encode this differential. The actual error is cleared whenever we successfully start or continue a new operation, but the error count is preserved until the operation is fully complete. This allows us to both determine when an error is new (when no error is currently recorded in the Status) and yet still do exponential backoff when multiple consecutive errors occur.

This fixes issues with registering, adopting, inspecting, and deprovisioning. The issues were different in each case due to a patchwork of implementations, which are now more consistent.

Some deprovisioning errors that were previously unrecoverable can now be retried. On deletion of the Host, we will retry deprovisioning up to 3 times before giving up and allowing the Host to disappear (previously there was no way to delete it in some states, other than manually removing the finalizer).

One known issue with this is that if a request to change state in Ironic results in a 409 Conflict response, this just sets the Dirty flag and is indistinguishable from success when viewed outside the Ironic provisioner. If such a conflict occurs, we will effectively skip a retry attempt, increment the ErrorCount again, and sleep for even longer before the next attempt.

metal3-io-bot · 2020-12-11T04:34:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [zaneb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zaneb · 2020-12-11T04:37:19Z

/test-integration

zaneb · 2020-12-11T14:43:36Z

/test-integration

furkatgofurov7 · 2020-12-11T14:48:04Z

@zaneb thanks a lot for the patch. I will test the patch with the scenario I was simulating with vbmc problem and come back report it here.

andfasano · 2020-12-11T15:21:19Z

@zaneb thanks a lot for the patch. I will test the patch with the scenario I was simulating with vbmc problem and come back report it here.

hey @furkatgofurov7 could you please share some additional details about the tested scenario? I'd like also to try reproducing it if possible

furkatgofurov7 · 2020-12-11T15:29:39Z

Hi @andfasano! sure thing, I have explained it in the issue itself in short, but I will add detailed steps for reproducing it as well.

andfasano · 2020-12-11T16:21:40Z

 	}

-	provResult, err := prov.ValidateManagementAccess(credsChanged)
+	provResult, err := prov.ValidateManagementAccess(credsChanged, info.host.Status.ErrorType == metal3v1alpha1.RegistrationError)


Minor: what about pushing this check directly within the related provisioner method (ValidateManagementAccess in this case)? It will help in keeping such logic within the provisioner code, and at the same time will minimize the impacts on the interface

The controller (not the provisioner) owns setting the errors and knowing the types, so conceptually I don't feel like this belongs in the provisioner.

The current implementation seems pretty fixed for every case, ie Adopt checks always for RegistrationError

andfasano · 2020-12-11T16:25:36Z

 	} else {
 		hsm.NextState = metal3v1alpha1.StateInspecting
 	}
+	hsm.Host.Status.ErrorCount = 0


Probably it could make sense to bind the error count reset to the actionComplete (maybe with an utility method like recordActionComplete), otherwise my feeling is that it could be easily missed next time some new code will be added handling a completed action.

It was nice having it be magic, but there's no current place to hang it - actionComplete doesn't work for the steady states.
The fact that we can't move the code into the state machine for the steady states indicates a design problem, and maybe once that is resolved we'll have a convenient place to tie it.

Agree that there's probably something to be reviewed in the steady states, as in the current implementation the ErrorCount is cleared when:

When completing an action and moving from/to:

Registering -> Inspecting/Ext.Prov

Inspecting -> Match Profile

Provisioning -> Provisioned

Deprovisioning -> Deleting / Ready

Due the power management within the steady states (Ext.Prov.//Provisioned ) and Ready

But at least for point 1 having an utility function that set the actionComplete and resets the ErrorCount could help in reducing the scattering.

furkatgofurov7 · 2020-12-14T18:59:16Z

@zaneb thanks a lot for the patch. I will test the patch with the scenario I was simulating with vbmc problem and come back report it here.

Nice work, I have tested the patch locally to test the scenario described in the issue, and it works flawlessly!

furkatgofurov7

Looks good to me.

/cc @dhellmann

andfasano · 2021-01-07T11:38:29Z

Looks also good to me

furkatgofurov7 · 2021-01-11T09:43:35Z

This needs a rebase though.

/cc @zaneb

metal3-io-bot · 2021-01-11T09:43:36Z

@furkatgofurov7: GitHub didn't allow me to request PR reviews from the following users: zaneb.

Note that only metal3-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

This needs a rebase though.

/cc @zaneb

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andfasano · 2021-01-11T10:22:52Z

@zaneb @furkatgofurov7 I think it could be useful in this scope also to land #757 as it contains a fix for ErrorCount

When a call to the provisioner succeeds rather than returning an error message, that's a good sign, and a reason to not have an error message set in the Host object. But it doesn't guarantee success: if the previous failure came at the end of some long-running operation (rather than outright rejection at the beginning), it could yet fail in exactly the same way. Clearing the ErrorType as soon as we start an operation allows us to use that field to determine whether to force the provisioner to retry in the presence of an error. (It will only be set if the last thing we did was record an error, therefore if we see it then we are at the beginning of a new retry.) One known issue with this is that if a request to change state in Ironic results in a 409 Conflict response, this just sets the Dirty flag and is indistinguishable from success when viewed outside the Ironic provisioner. If such a conflict occurs, we will effectively skip a retry attempt, increment the ErrorCount again, and sleep for even longer before the next attempt. To ensure that we actually do exponential backoff between retries, leave the ErrorCount unchanged until the whole operation actually succeeds (Dirty is no longer true). This is now decoupled from the ClearError() method that clears the ErrorType and ErrorMessage. As much as possible, do the clearing of ErrorCount in the host state machine. The exception is the steady states where we only do power management - the actionResult types currently lack enough detail to distinguish when count should be cleared when viewed from the state machine. In power management states (Ready/Available, Provisioned, Externally Provisioned), count the number of errors of any type since the power state was last successfully changed. Otherwise, the ErrorCount is cleared when an operation succesfully completes. Successful re-registration or adoption is never sufficient to clear the error count, except in the registration state.

Once we see an error in the Node, it returns to the 'enroll' [sic] state and we don't have a way of determining whether we have seen and saved that error or not. Previously we always assumed we had not, and didn't retry the validation unless the credentials had changed. Add a force flag to indicate that this is a new attempt and should start again. Fixes metal3-io#739

If inspection fails at some point before it is actually started in ironic-inspector, we were just repeatedly retrying it instead of setting an error. Instead, set an error and retry only after a backoff.

Error is the state that the Node goes into if deleting (i.e. deprovisioning) it fails. The only valid action from this state is to try deleting again, so do that rather than attempt to go straight to manageable.

If deprovisioning fails, we were just repeatedly retrying it instead of setting an error. Instead, set an error and retry only after a backoff. If we are deprovisioning because of a deletion, give up after 3 retries (since this may indicate that the host simply doesn't exist any more) and allow the Host to be deleted.

zaneb · 2021-01-11T21:58:20Z

Rebased.
/test-integration

furkatgofurov7 · 2021-01-13T10:14:44Z

LGTM, leaving it up to @andfasano to give a final lgtm

maelk · 2021-01-13T14:19:55Z

/lgtm

andfasano · 2021-01-13T15:24:40Z

Thanks @furkatgofurov7, even if late I was fine with them

zaneb requested review from andfasano, dhellmann and furkatgofurov7 December 11, 2020 04:34

metal3-io-bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 11, 2020

zaneb force-pushed the retries-with-backoff branch from f892927 to b1cc1f5 Compare December 11, 2020 14:42

andfasano reviewed Dec 11, 2020

View reviewed changes

zaneb mentioned this pull request Dec 11, 2020

Limit the number of hosts simultaneously provisioned #725

Merged

furkatgofurov7 reviewed Dec 14, 2020

View reviewed changes

zaneb force-pushed the retries-with-backoff branch from b1cc1f5 to c8c033f Compare January 11, 2021 21:18

zaneb added 5 commits January 11, 2021 16:41

Delay retry on inspect failed

2d02462

If inspection fails at some point before it is actually started in ironic-inspector, we were just repeatedly retrying it instead of setting an error. Instead, set an error and retry only after a backoff.

Ironic: Handle Error state correctly

7791d21

Error is the state that the Node goes into if deleting (i.e. deprovisioning) it fails. The only valid action from this state is to try deleting again, so do that rather than attempt to go straight to manageable.

provisioner: Fix copy-paste error in docs

7c3daef

zaneb force-pushed the retries-with-backoff branch from c8c033f to 7c3daef Compare January 11, 2021 21:41

zaneb mentioned this pull request Jan 12, 2021

Retry when node in "Adopt Failed" state #697

Closed

zaneb requested review from andfasano and furkatgofurov7 January 12, 2021 20:36

metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2021

metal3-io-bot merged commit 9d5e8d3 into metal3-io:master Jan 13, 2021

zaneb mentioned this pull request Jan 15, 2021

Force retry when adoption fails #762

Merged

zaneb mentioned this pull request Mar 23, 2021

Power management errors are never reported #828

Closed

Conversation

zaneb commented Dec 11, 2020

Uh oh!

metal3-io-bot commented Dec 11, 2020

Uh oh!

zaneb commented Dec 11, 2020

Uh oh!

zaneb commented Dec 11, 2020

Uh oh!

furkatgofurov7 commented Dec 11, 2020

Uh oh!

andfasano commented Dec 11, 2020

Uh oh!

furkatgofurov7 commented Dec 11, 2020

Uh oh!

andfasano Dec 11, 2020

Choose a reason for hiding this comment

Uh oh!

zaneb Dec 13, 2020

Choose a reason for hiding this comment

Uh oh!

andfasano Dec 14, 2020

Choose a reason for hiding this comment

Uh oh!

andfasano Dec 11, 2020

Choose a reason for hiding this comment

Uh oh!

zaneb Dec 13, 2020

Choose a reason for hiding this comment

Uh oh!

andfasano Dec 14, 2020

Choose a reason for hiding this comment

Uh oh!

furkatgofurov7 commented Dec 14, 2020

Uh oh!

furkatgofurov7 left a comment

Choose a reason for hiding this comment

Uh oh!

andfasano commented Jan 7, 2021

Uh oh!

furkatgofurov7 commented Jan 11, 2021

Uh oh!

metal3-io-bot commented Jan 11, 2021

Uh oh!

andfasano commented Jan 11, 2021

Uh oh!

zaneb commented Jan 11, 2021

Uh oh!

furkatgofurov7 commented Jan 13, 2021

Uh oh!

maelk commented Jan 13, 2021

Uh oh!

andfasano commented Jan 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants