Force retry when adoption fails#762
Conversation
|
I've been thinking about alternative names that could avoid exposing to the user the implementation detail that after re-registration in the provisioned state, we call Ironic's adopt API. |
5a7e7bd to
7031d68
Compare
7031d68 to
2e40a14
Compare
2e40a14 to
78cb6d4
Compare
b8cf13e to
8eed652
Compare
|
/test-integration |
|
This LGTM. |
The only time this error would come up is if the pod is rescheduled and we have to restore Ironic's state? That feels like we're exposing an internal error to the user. Can they even do anything about it? I haven't followed the discussion, but is there a technical reason to not use |
Yes. Registration previously cleared out all error messages, I've refactored it to specifically look for RegistrationError instead, and introduced AdoptionError when there's a failure to adopt. Currently if adoption fails we never retry because the registration state clears all errors out. |
|
Do we have a corresponding state to indicate that a host is being adopted? I'm trying to understand if this new error condition exposes a concept to the user that we haven't previously expected them to know about. If not, then I might just call it something like Since we don't really want to expose the error to the user, it would be even better if we could use ironic's status directly, but I don't think we use that pattern anywhere else. |
|
Yeah, so it's kind of a philosophical question. Ideally the fact that Ironic keeps a bunch of state in a relational database that occasionally gets deleted and needs to be regenerated should be something that users never need to know about. However, if something goes wrong with that part, then we need a way of recording it. Arguably we do already expose this to the user in the sense that a registration error can occur at any time, not just in the Registration state. But having an adoption error seems worse because of the complete meaninglessness of the name outside of the implementation detail that we use in Ironic to accomplish it. Nevertheless we must have some way of distinguishing errors in adoption from errors in registration to fix the bug. My thought process behind |
|
The idea of |
8eed652 to
6530c93
Compare
6530c93 to
0367f15
Compare
|
Thanks all, I've updated the message to the proposed name. Please take another look /test-integration |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andfasano, stbenjam, zaneb The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
This also LGTM. |
|
This needs a rebase though. /cc @stbenjam |
|
@furkatgofurov7: GitHub didn't allow me to request PR reviews from the following users: stbenjam. Note that only metal3-io members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Currently when adoption fails, we use RegistrationError but this gets cleared any time registration succeeds. We need a separate signal for adoption failure, which allows us to force retry the adoption again. This adds a new AdoptionError, and ensures that if registration succeeds we don't clear an AdoptionError. Co-authored-by: Andrea Fasano <afasano@redhat.com>
0367f15 to
a10a2ad
Compare
|
Thanks, rebased! /test-integration |
|
/lgtm |
Currently when adoption fails, we use RegistrationError but this gets
cleared any time registration succeeds. We need a separate signal for
adoption failure, which allows us to force retry the adoption again.
Fixes #697