Skip to content

Always retry provisioning operations on failure (continue)#610

Merged
metal3-io-bot merged 1 commit intometal3-io:masterfrom
andfasano:keep-retrying-ext
Sep 17, 2020
Merged

Always retry provisioning operations on failure (continue)#610
metal3-io-bot merged 1 commit intometal3-io:masterfrom
andfasano:keep-retrying-ext

Conversation

@andfasano
Copy link
Copy Markdown
Member

@andfasano andfasano commented Jul 30, 2020

This PR replaces the one started from @honza on PR #584.

Reconciliation loop now retries the operation whenever an action failure is detected, with a backoff. The backoff calculation has been reviewed and some jitter added, and the error count is zeroed when the action succeeds.

@metal3-io-bot metal3-io-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 30, 2020
@andfasano andfasano force-pushed the keep-retrying-ext branch 2 times, most recently from f8fa6dc to e7e1dd7 Compare July 31, 2020 14:25
@andfasano andfasano changed the title [WIP] Always retry provisioning operations on failure (continue) Always retry provisioning operations on failure (continue) Jul 31, 2020
@metal3-io-bot metal3-io-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 31, 2020
@andfasano
Copy link
Copy Markdown
Member Author

/test-integration

Comment thread pkg/controller/baremetalhost/host_state_machine.go Outdated
Comment thread pkg/apis/metal3/v1alpha1/baremetalhost_types.go Outdated
Comment thread pkg/apis/metal3/v1alpha1/baremetalhost_types.go Outdated
Comment thread pkg/controller/baremetalhost/baremetalhost_controller.go Outdated
Comment thread pkg/controller/baremetalhost/action_result.go Outdated
Comment thread pkg/controller/baremetalhost/action_result.go Outdated
return backOffDuration
}

func (r actionFailed) Result() (result reconcile.Result, err error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it would be fine to change the signature of Result() to pass the error count if it were more convenient for us to not calculate it until later.

Comment thread pkg/apis/metal3/v1alpha1/baremetalhost_types.go Outdated
Comment thread pkg/controller/baremetalhost/baremetalhost_controller.go Outdated
Comment thread pkg/controller/baremetalhost/host_state_machine.go Outdated
@metal3-io-bot metal3-io-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 6, 2020
Comment thread pkg/apis/metal3/v1alpha1/baremetalhost_types.go Outdated
Comment thread pkg/apis/metal3/v1alpha1/baremetalhost_types.go Outdated
@andfasano andfasano force-pushed the keep-retrying-ext branch 2 times, most recently from 7e4bdfc to d5bea5a Compare September 3, 2020 09:13
@andfasano
Copy link
Copy Markdown
Member Author

/test-integration

Comment thread pkg/controller/baremetalhost/baremetalhost_controller.go
Copy link
Copy Markdown
Member

@dhellmann dhellmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Comment thread pkg/controller/baremetalhost/baremetalhost_controller.go
@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 9, 2020
@maelk
Copy link
Copy Markdown
Member

maelk commented Sep 9, 2020

would you mind squashing the commits please ?

@andfasano
Copy link
Copy Markdown
Member Author

/test-integration

@andfasano
Copy link
Copy Markdown
Member Author

/test govet

@andfasano
Copy link
Copy Markdown
Member Author

/test unit

@andfasano
Copy link
Copy Markdown
Member Author

/test-integration

@dhellmann
Copy link
Copy Markdown
Member

/approve

@dhellmann dhellmann requested a review from zaneb September 15, 2020 17:39
Copy link
Copy Markdown
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring in a separate patch is good, but please squash stuff like "Fix imports".
/approve

Comment thread pkg/apis/metal3/v1alpha1/baremetalhost_types.go Outdated
Comment thread pkg/controller/baremetalhost/baremetalhost_controller.go
@andfasano
Copy link
Copy Markdown
Member Author

/test-integration

Improve the reconciliation loop whenever an action failure is detected
(or credential error) by applying a retry pattern with exponentional
backoff with jitter to avoid service overloading - currently affected
statuses are: Deprovisioning, Externally Provisioned, Inspecting,
Provisioned, Provisioning and Registering.
A new `ErrorCount` field has been added in the `BareMetalHostStatus`
to support such behavior: it gets incremented every time an action
failure is recorded, and it is cleared out when the action completes
successfully.
@andfasano
Copy link
Copy Markdown
Member Author

andfasano commented Sep 16, 2020

/test-integration

@dhellmann
Copy link
Copy Markdown
Member

/approve

@metal3-io-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano, dhellmann, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zaneb
Copy link
Copy Markdown
Member

zaneb commented Sep 17, 2020

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 17, 2020
@maelk
Copy link
Copy Markdown
Member

maelk commented Sep 17, 2020

/test-integration

@metal3-io-bot metal3-io-bot merged commit aa20f8a into metal3-io:master Sep 17, 2020
honza pushed a commit to honza/baremetal-operator that referenced this pull request Sep 23, 2020
Always retry provisioning operations on failure (continue)
honza pushed a commit to honza/baremetal-operator that referenced this pull request Sep 29, 2020
Always retry provisioning operations on failure (continue)
honza pushed a commit to honza/baremetal-operator that referenced this pull request Sep 29, 2020
Always retry provisioning operations on failure (continue)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants