-
Notifications
You must be signed in to change notification settings - Fork 144
Add proposal for retry interface #171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| <!-- | ||
| This work is licensed under a Creative Commons Attribution 3.0 | ||
| Unported License. | ||
|
|
||
| http://creativecommons.org/licenses/by/3.0/legalcode | ||
| --> | ||
|
|
||
| # Title | ||
|
|
||
| Retry from error state | ||
|
|
||
| ## Status | ||
|
|
||
| provisional | ||
|
|
||
| ## Summary | ||
|
|
||
| This proposal aims to introduce a new declarative API | ||
| to forcibly retry the reconciliation of a waiting BareMetalHost resource | ||
| (being in error state) when a fix is immediately available, so that it | ||
| could be possible to reduce the waiting time required | ||
|
|
||
| ## Motivation | ||
|
|
||
| The current BMO error retry mechanism implements an | ||
| exponential backoff with jittering: this means that when a BMH | ||
| goes into an error state its reconciliation loop gap increases | ||
| as long as the error conditions persists (based on BMH `Status.ErrorCount`). | ||
|
|
||
| In some cases it's necessary a human (or more generally an external) | ||
| intervention to fix the issue. If the fix involves amending the | ||
| BMH resource then an immediate reconciliation is triggered and the | ||
| new state is evaluated again. | ||
| Instead, if the fix does not generate a relevant event (for example, | ||
| a firmware update on the host or a temporary connectivity loss to the | ||
| BMC network) captured by BMO, then it's necessary | ||
| to wait for the next scheduled reconciliation loop for a new | ||
| state evaluation, and the waiting time could be longer for higher | ||
| values of `Status.ErrorCount`. | ||
|
|
||
| ### Goals | ||
|
|
||
| * A declarative API to force retrying the evaluation of a BMH currently | ||
| waiting in error state | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| * Clear or remove the current BMH error state | ||
|
|
||
| ## Proposal | ||
|
|
||
| ## Design Details | ||
|
|
||
| A new annotation `retry.metal3.io` will be used to trigger a new | ||
| reconciliation loop, and to reset the `Status.ErrorCount` field to 1. In this | ||
| way, the BMH resource will restart reconciliation with a narrowed loops | ||
|
|
||
| ### Implementation Details/Notes/Constraints | ||
|
|
||
| The annotation must be consumed as soon as possible within the BMO reconcile | ||
| loop, and it should be removed as soon as the `Status.ErrorCount` is set to 1. | ||
| If the BMH resource is not in an error state (`Status.ErrorCount` already | ||
| set to zero) then the annotation must be removed and the event properly logged. | ||
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| - | ||
|
|
||
| ### Work Items | ||
|
|
||
| * Modify the BMO reconcile loop to handle the new annotation | ||
| * Unit tests for the above points | ||
|
|
||
| ### Test Plan | ||
|
|
||
| Verify through unit tests that a reconcile loop remove the | ||
| `retry.metal3.io` annotation and sets the ErrorCount field to 1 (if the BMH | ||
| was in error) | ||
|
|
||
| ### Upgrade / Downgrade Strategy | ||
|
|
||
| None | ||
|
|
||
| ## Drawbacks | ||
|
|
||
| None | ||
|
|
||
| ## Alternatives | ||
|
|
||
| Reducing the backoff gaps (or upper limit) could mitigate somehow as the | ||
| waiting time will be reduced | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not reduce the upper bound for the backoff?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another alternative: build a CLI util that resets the error count without making it part of the BMH API.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think it's listed as a an alternative.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it is listed as an alternative, but there is no reason given about why it was rejected.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I misunderstood the goal of the |
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not clear why we need a special API for this. Any edit to the host, including adding a temporary label, would trigger reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the difference this makes is that if it fails again for a different (perhaps more easily recoverable) reason then the time to retry is shorter.
That said, I must confess to being somewhat sceptical of the value also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the key point of the proposal it's about resetting the ErrorCount to 1 - otherwise any edit will be fine fo retriggering the reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to artificially reset the error count? Just to shorten retry intervals if there's another failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (or another set of failures)