Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions design/baremetal-operator/retry-interface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
<!--
This work is licensed under a Creative Commons Attribution 3.0
Unported License.

http://creativecommons.org/licenses/by/3.0/legalcode
-->

# Title

Retry from error state

## Status

provisional

## Summary

This proposal aims to introduce a new declarative API
to forcibly retry the reconciliation of a waiting BareMetalHost resource
(being in error state) when a fix is immediately available, so that it
could be possible to reduce the waiting time required
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not clear why we need a special API for this. Any edit to the host, including adding a temporary label, would trigger reconciliation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the difference this makes is that if it fails again for a different (perhaps more easily recoverable) reason then the time to retry is shorter.
That said, I must confess to being somewhat sceptical of the value also.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the key point of the proposal it's about resetting the ErrorCount to 1 - otherwise any edit will be fine fo retriggering the reconciliation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to artificially reset the error count? Just to shorten retry intervals if there's another failure?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (or another set of failures)


## Motivation

The current BMO error retry mechanism implements an
exponential backoff with jittering: this means that when a BMH
goes into an error state its reconciliation loop gap increases
as long as the error conditions persists (based on BMH `Status.ErrorCount`).

In some cases it's necessary a human (or more generally an external)
intervention to fix the issue. If the fix involves amending the
BMH resource then an immediate reconciliation is triggered and the
new state is evaluated again.
Instead, if the fix does not generate a relevant event (for example,
a firmware update on the host or a temporary connectivity loss to the
BMC network) captured by BMO, then it's necessary
to wait for the next scheduled reconciliation loop for a new
state evaluation, and the waiting time could be longer for higher
values of `Status.ErrorCount`.

### Goals

* A declarative API to force retrying the evaluation of a BMH currently
waiting in error state

### Non-Goals

* Clear or remove the current BMH error state

## Proposal

## Design Details

A new annotation `retry.metal3.io` will be used to trigger a new
reconciliation loop, and to reset the `Status.ErrorCount` field to 1. In this
way, the BMH resource will restart reconciliation with a narrowed loops

### Implementation Details/Notes/Constraints

The annotation must be consumed as soon as possible within the BMO reconcile
loop, and it should be removed as soon as the `Status.ErrorCount` is set to 1.
If the BMH resource is not in an error state (`Status.ErrorCount` already
set to zero) then the annotation must be removed and the event properly logged.

### Risks and Mitigations

-

### Work Items

* Modify the BMO reconcile loop to handle the new annotation
* Unit tests for the above points

### Test Plan

Verify through unit tests that a reconcile loop remove the
`retry.metal3.io` annotation and sets the ErrorCount field to 1 (if the BMH
was in error)

### Upgrade / Downgrade Strategy

None

## Drawbacks

None

## Alternatives

Reducing the backoff gaps (or upper limit) could mitigate somehow as the
waiting time will be reduced
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reduce the upper bound for the backoff?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative: build a CLI util that resets the error count without making it part of the BMH API.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reduce the upper bound for the backoff?

I think it's listed as a an alternative. maxBackOffCount could be reduced to cap the max waiting time after the host reached that number of (failed) iterations. Current gap values can be checked here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is listed as an alternative, but there is no reason given about why it was rejected.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misunderstood the goal of the Alternatives section, but my intention was to list another approach so that it could have been discussed. Personally I think it's still a potentially viable alternative - as not requiring explicit user intervention