Skip to content

Do retries with backoff in ValidateManagementAccess, Inspect, and Deprovision#749

Merged
metal3-io-bot merged 6 commits intometal3-io:masterfrom
zaneb:retries-with-backoff
Jan 13, 2021
Merged

Do retries with backoff in ValidateManagementAccess, Inspect, and Deprovision#749
metal3-io-bot merged 6 commits intometal3-io:masterfrom
zaneb:retries-with-backoff

Conversation

@zaneb
Copy link
Copy Markdown
Member

@zaneb zaneb commented Dec 11, 2020

As discussed in #739, in order to decide whether an error we see from Ironic is one that we have seen and dealt with before (in which case we should retry the present operation) or not (in which case we should deal with the error) we need to store some sort of state.

This PR uses a differential between the ErrorType/ErrorMessage and the ErrorCount to encode this differential. The actual error is cleared whenever we successfully start or continue a new operation, but the error count is preserved until the operation is fully complete. This allows us to both determine when an error is new (when no error is currently recorded in the Status) and yet still do exponential backoff when multiple consecutive errors occur.

This fixes issues with registering, adopting, inspecting, and deprovisioning. The issues were different in each case due to a patchwork of implementations, which are now more consistent.

Some deprovisioning errors that were previously unrecoverable can now be retried. On deletion of the Host, we will retry deprovisioning up to 3 times before giving up and allowing the Host to disappear (previously there was no way to delete it in some states, other than manually removing the finalizer).

One known issue with this is that if a request to change state in Ironic results in a 409 Conflict response, this just sets the Dirty flag and is indistinguishable from success when viewed outside the Ironic provisioner. If such a conflict occurs, we will effectively skip a retry attempt, increment the ErrorCount again, and sleep for even longer before the next attempt.

@metal3-io-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 11, 2020
@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Dec 11, 2020

/test-integration

@zaneb zaneb force-pushed the retries-with-backoff branch from f892927 to b1cc1f5 Compare December 11, 2020 14:42
@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Dec 11, 2020

/test-integration

@furkatgofurov7
Copy link
Copy Markdown
Member

@zaneb thanks a lot for the patch. I will test the patch with the scenario I was simulating with vbmc problem and come back report it here.

@andfasano
Copy link
Copy Markdown
Member

@zaneb thanks a lot for the patch. I will test the patch with the scenario I was simulating with vbmc problem and come back report it here.

hey @furkatgofurov7 could you please share some additional details about the tested scenario? I'd like also to try reproducing it if possible

@furkatgofurov7
Copy link
Copy Markdown
Member

Hi @andfasano! sure thing, I have explained it in the issue itself in short, but I will add detailed steps for reproducing it as well.

}

provResult, err := prov.ValidateManagementAccess(credsChanged)
provResult, err := prov.ValidateManagementAccess(credsChanged, info.host.Status.ErrorType == metal3v1alpha1.RegistrationError)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: what about pushing this check directly within the related provisioner method (ValidateManagementAccess in this case)? It will help in keeping such logic within the provisioner code, and at the same time will minimize the impacts on the interface

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller (not the provisioner) owns setting the errors and knowing the types, so conceptually I don't feel like this belongs in the provisioner.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation seems pretty fixed for every case, ie Adopt checks always for RegistrationError

} else {
hsm.NextState = metal3v1alpha1.StateInspecting
}
hsm.Host.Status.ErrorCount = 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it could make sense to bind the error count reset to the actionComplete (maybe with an utility method like recordActionComplete), otherwise my feeling is that it could be easily missed next time some new code will be added handling a completed action.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was nice having it be magic, but there's no current place to hang it - actionComplete doesn't work for the steady states.
The fact that we can't move the code into the state machine for the steady states indicates a design problem, and maybe once that is resolved we'll have a convenient place to tie it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that there's probably something to be reviewed in the steady states, as in the current implementation the ErrorCount is cleared when:

  1. When completing an action and moving from/to:
  • Registering -> Inspecting/Ext.Prov
  • Inspecting -> Match Profile
  • Provisioning -> Provisioned
  • Deprovisioning -> Deleting / Ready
  1. Due the power management within the steady states (Ext.Prov.//Provisioned ) and Ready

But at least for point 1 having an utility function that set the actionComplete and resets the ErrorCount could help in reducing the scattering.

@furkatgofurov7
Copy link
Copy Markdown
Member

@zaneb thanks a lot for the patch. I will test the patch with the scenario I was simulating with vbmc problem and come back report it here.

Nice work, I have tested the patch locally to test the scenario described in the issue, and it works flawlessly!

Copy link
Copy Markdown
Member

@furkatgofurov7 furkatgofurov7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

/cc @dhellmann

@andfasano
Copy link
Copy Markdown
Member

Looks also good to me

@furkatgofurov7
Copy link
Copy Markdown
Member

This needs a rebase though.

/cc @zaneb

@metal3-io-bot
Copy link
Copy Markdown
Contributor

@furkatgofurov7: GitHub didn't allow me to request PR reviews from the following users: zaneb.

Note that only metal3-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

This needs a rebase though.

/cc @zaneb

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andfasano
Copy link
Copy Markdown
Member

@zaneb @furkatgofurov7 I think it could be useful in this scope also to land #757 as it contains a fix for ErrorCount

When a call to the provisioner succeeds rather than returning an error
message, that's a good sign, and a reason to not have an error message
set in the Host object. But it doesn't guarantee success: if the
previous failure came at the end of some long-running operation (rather
than outright rejection at the beginning), it could yet fail in exactly
the same way.

Clearing the ErrorType as soon as we start an operation allows us to use
that field to determine whether to force the provisioner to retry in the
presence of an error. (It will only be set if the last thing we did was
record an error, therefore if we see it then we are at the beginning of
a new retry.)

One known issue with this is that if a request to change state in Ironic
results in a 409 Conflict response, this just sets the Dirty flag and is
indistinguishable from success when viewed outside the Ironic
provisioner. If such a conflict occurs, we will effectively skip a retry
attempt, increment the ErrorCount again, and sleep for even longer
before the next attempt.

To ensure that we actually do exponential backoff between retries, leave
the ErrorCount unchanged until the whole operation actually succeeds
(Dirty is no longer true). This is now decoupled from the ClearError()
method that clears the ErrorType and ErrorMessage.
As much as possible, do the clearing of ErrorCount in the host state
machine. The exception is the steady states where we only do power
management - the actionResult types currently lack enough detail to
distinguish when count should be cleared when viewed from the state
machine.

In power management states (Ready/Available, Provisioned, Externally
Provisioned), count the number of errors of any type since the power
state was last successfully changed.

Otherwise, the ErrorCount is cleared when an operation succesfully
completes. Successful re-registration or adoption is never sufficient to
clear the error count, except in the registration state.
@zaneb zaneb force-pushed the retries-with-backoff branch from b1cc1f5 to c8c033f Compare January 11, 2021 21:18
zaneb added 5 commits January 11, 2021 16:41
Once we see an error in the Node, it returns to the 'enroll' [sic] state
and we don't have a way of determining whether we have seen and saved
that error or not. Previously we always assumed we had not, and didn't
retry the validation unless the credentials had changed.

Add a force flag to indicate that this is a new attempt and should start
again.

Fixes metal3-io#739
If inspection fails at some point before it is actually started in
ironic-inspector, we were just repeatedly retrying it instead of setting
an error. Instead, set an error and retry only after a backoff.
Error is the state that the Node goes into if deleting (i.e.
deprovisioning) it fails. The only valid action from this state is to
try deleting again, so do that rather than attempt to go straight to
manageable.
If deprovisioning fails, we were just repeatedly retrying it instead of
setting an error. Instead, set an error and retry only after a backoff.

If we are deprovisioning because of a deletion, give up after 3 retries
(since this may indicate that the host simply doesn't exist any more)
and allow the Host to be deleted.
@zaneb zaneb force-pushed the retries-with-backoff branch from c8c033f to 7c3daef Compare January 11, 2021 21:41
@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Jan 11, 2021

Rebased.
/test-integration

@furkatgofurov7
Copy link
Copy Markdown
Member

LGTM, leaving it up to @andfasano to give a final lgtm

@maelk
Copy link
Copy Markdown
Member

maelk commented Jan 13, 2021

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2021
@metal3-io-bot metal3-io-bot merged commit 9d5e8d3 into metal3-io:master Jan 13, 2021
@andfasano
Copy link
Copy Markdown
Member

Thanks @furkatgofurov7, even if late I was fine with them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants