Skip to content

Force retry when adoption fails#762

Merged
metal3-io-bot merged 1 commit intometal3-io:masterfrom
stbenjam:fix-adoption
Jan 20, 2021
Merged

Force retry when adoption fails#762
metal3-io-bot merged 1 commit intometal3-io:masterfrom
stbenjam:fix-adoption

Conversation

@stbenjam
Copy link
Copy Markdown
Member

Currently when adoption fails, we use RegistrationError but this gets
cleared any time registration succeeds. We need a separate signal for
adoption failure, which allows us to force retry the adoption again.

Fixes #697

@metal3-io-bot metal3-io-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 13, 2021
Comment thread apis/metal3.io/v1alpha1/baremetalhost_types.go Outdated
Comment thread controllers/metal3.io/host_state_machine.go Outdated
@zaneb
Copy link
Copy Markdown
Member

zaneb commented Jan 13, 2021

I've been thinking about alternative names that could avoid exposing to the user the implementation detail that after re-registration in the provisioned state, we call Ironic's adopt API.
The best I've come up with is "provisioned registration error" which is admittedly terrible. Any other ideas?

Comment thread controllers/metal3.io/baremetalhost_controller.go
@stbenjam stbenjam force-pushed the fix-adoption branch 3 times, most recently from 5a7e7bd to 7031d68 Compare January 14, 2021 19:19
@metal3-io-bot metal3-io-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 14, 2021
Comment thread controllers/metal3.io/host_state_machine_test.go Outdated
@stbenjam stbenjam changed the title [wip] Force retry when adoption fails Force retry when adoption fails Jan 15, 2021
@metal3-io-bot metal3-io-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 15, 2021
@stbenjam stbenjam force-pushed the fix-adoption branch 2 times, most recently from b8cf13e to 8eed652 Compare January 18, 2021 15:53
@stbenjam
Copy link
Copy Markdown
Member Author

/test-integration

@stbenjam stbenjam requested review from andfasano and zaneb January 18, 2021 16:16
@zaneb
Copy link
Copy Markdown
Member

zaneb commented Jan 18, 2021

This LGTM.
@dhellmann do you have any opinion on the name of this error?

Copy link
Copy Markdown
Member

@andfasano andfasano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also LGTM

@dhellmann
Copy link
Copy Markdown
Member

This LGTM.
@dhellmann do you have any opinion on the name of this error?

The only time this error would come up is if the pod is rescheduled and we have to restore Ironic's state? That feels like we're exposing an internal error to the user. Can they even do anything about it?

I haven't followed the discussion, but is there a technical reason to not use RegistrationError? Maybe we have to be able to tell the 2 errors apart inside the BMO code to trigger the retry in the right way?

@stbenjam
Copy link
Copy Markdown
Member Author

This LGTM.
@dhellmann do you have any opinion on the name of this error?

The only time this error would come up is if the pod is rescheduled and we have to restore Ironic's state? That feels like we're exposing an internal error to the user. Can they even do anything about it?

I haven't followed the discussion, but is there a technical reason to not use RegistrationError? Maybe we have to be able to tell the 2 errors apart inside the BMO code to trigger the retry in the right way?

Yes. Registration previously cleared out all error messages, I've refactored it to specifically look for RegistrationError instead, and introduced AdoptionError when there's a failure to adopt. Currently if adoption fails we never retry because the registration state clears all errors out.

@dhellmann
Copy link
Copy Markdown
Member

Do we have a corresponding state to indicate that a host is being adopted? I'm trying to understand if this new error condition exposes a concept to the user that we haven't previously expected them to know about. If not, then I might just call it something like InternalError. That's pretty vague, but doesn't leave the user wondering why a host is showing an error when they didn't do anything to it.

Since we don't really want to expose the error to the user, it would be even better if we could use ironic's status directly, but I don't think we use that pattern anywhere else.

@zaneb
Copy link
Copy Markdown
Member

zaneb commented Jan 18, 2021

Yeah, so it's kind of a philosophical question.
Adoption is not mentioned anywhere else in the public API, so this is a new and potentially confusing piece of information to users. (If 'adoption' means anything to users, it's probably the thing that the ExternallyProvisioned flag does.)

Ideally the fact that Ironic keeps a bunch of state in a relational database that occasionally gets deleted and needs to be regenerated should be something that users never need to know about. However, if something goes wrong with that part, then we need a way of recording it. Arguably we do already expose this to the user in the sense that a registration error can occur at any time, not just in the Registration state. But having an adoption error seems worse because of the complete meaninglessness of the name outside of the implementation detail that we use in Ironic to accomplish it. Nevertheless we must have some way of distinguishing errors in adoption from errors in registration to fix the bug.

My thought process behind provisioned registration error was basically that adoption is the second half of the registration process that only occurs when the Host is already provisioned. I think that's a productive line of enquiry even though that particular name is pretty so-so.

@andfasano
Copy link
Copy Markdown
Member

The idea of InternalError looks appealing, as it could be of some use in future - at the same time it risks to be abused being so generic, whereas the provisioned registration error is definitely more granular and strictly scoped for a very specific kind of error. Anyhow, by looking at the current Provisioner design, where many actions could depend on ErrorType (the force bool), I'd say that we're moving into a direction closer to having more granular errors (maybe in future we could introduce a SubErrorType?) rather than generic.
The name provisioned registration error doesn't sound too bad to me (and I don't have better alternatives), so I think we could proceed in that direction.

@stbenjam
Copy link
Copy Markdown
Member Author

Thanks all, I've updated the message to the proposed name. Please take another look

/test-integration

@zaneb
Copy link
Copy Markdown
Member

zaneb commented Jan 19, 2021

/approve
Will leave this open until tomorrow in case somebody else has a better idea for the name.

@metal3-io-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano, stbenjam, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zaneb zaneb requested a review from furkatgofurov7 January 19, 2021 21:14
@furkatgofurov7
Copy link
Copy Markdown
Member

This also LGTM.
I do not have either better alternative for the name, but provisioned registration error sounds reasonable.

@furkatgofurov7
Copy link
Copy Markdown
Member

This needs a rebase though.

/cc @stbenjam

@metal3-io-bot
Copy link
Copy Markdown
Contributor

@furkatgofurov7: GitHub didn't allow me to request PR reviews from the following users: stbenjam.

Note that only metal3-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

This needs a rebase though.

/cc @stbenjam

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Currently when adoption fails, we use RegistrationError but this gets
cleared any time registration succeeds. We need a separate signal for
adoption failure, which allows us to force retry the adoption again.
This adds a new AdoptionError, and ensures that if registration succeeds
we don't clear an AdoptionError.

Co-authored-by: Andrea Fasano <afasano@redhat.com>
@stbenjam
Copy link
Copy Markdown
Member Author

Thanks, rebased!

/test-integration

@zaneb
Copy link
Copy Markdown
Member

zaneb commented Jan 20, 2021

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 20, 2021
@metal3-io-bot metal3-io-bot merged commit 625678a into metal3-io:master Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retry when node in "Adopt Failed" state

6 participants