Handle DeployFail and CleanFail state in deprovisioning#289
Handle DeployFail and CleanFail state in deprovisioning#289longkb wants to merge 1 commit intometal3-io:masterfrom
DeployFail and CleanFail state in deprovisioning#289Conversation
|
Hey @longkb, thanks for the PR, but doesn't this assume that we're happy with cleaning failing? It looks like your code puts nodes that have failed cleaning back into 'manage' mode, which should make them available. Is that your intention? |
Yes, I would like to bring Ironic node from |
|
|
||
| case nodes.Error, nodes.CleanFail: | ||
| case nodes.Error, nodes.DeployFail: | ||
| if !ironicNode.Maintenance { |
There was a problem hiding this comment.
I'm not sure why this was done, it doesn't seem to be needed in either case. Actually, automated cleaning will timeout and fail if maintenance is on.
There was a problem hiding this comment.
IMO, we need to take care not only automated cleaning, but also manual cleaning (for RAID and BIOS configuration)
There was a problem hiding this comment.
Right, but I'm not sure how it is related to my comment: any in-band clean steps will timeout and fail if in maintenance.
| nodes.ProvisionStateOpts{Target: nodes.TargetDeleted}, | ||
| ) | ||
|
|
||
| case nodes.CleanFail: |
There was a problem hiding this comment.
CleanFail always results in maintenance mode, we have to unset it before we can do manage.
There was a problem hiding this comment.
I try to do manage verb from cleanfail without enter maintenance mode, and it worked smoothly.
There was a problem hiding this comment.
Right, but you'll be left with maintenance mode on.
There was a problem hiding this comment.
Ah ha, I have checked the workflow, and you was right. So should I unset maintenance flag after jump from cleanfail to manage?
case nodes.CleanFail:
result, err = p.changeNodeProvisionState(
ironicNode,
nodes.ProvisionStateOpts{Target: nodes.TargetManage},
)
if ironicNode.Maintenance {
p.log.Info("unset host maintenance flag to make host ready again")
p.setMaintenanceFlag(ironicNode, false)
}
return result, errThere was a problem hiding this comment.
I suggest unsetting maintenance before taking any actions.
There was a problem hiding this comment.
Thanks for your reviewing. The maintenance flag is unset before executing manage verb now.
There was a problem hiding this comment.
It's currently done after doing manage, which is not correct. Essentially no actions should be taken in maintenance.
If we ever enable full cleaning, this ^^ will risk leaving user's data on disks. Since we don't use full cleaning, it's probably fine. |
My use case is ironic node entered clean fail during provisioning, then we need to turn back to manageable with manage verb, not deleted verb as shown in deprovisoning code |
7f16c45 to
e3d60b9
Compare
|
Can one of the admins verify this patch? |
|
add to whitelist |
|
/test-integration |
|
/test-centos-integration |
|
@zaneb can you take a look at this one? |
|
This seems consistent with the state diagram, although clearly @dtantsur is the expert here.
This is somewhat concerning. From an upstream project perspective, it seems likely that eventually there will be people who want to use metal3 both with full cleaning as well as those who want to use it without. It would nice if there were some way of at least documenting that this is something we'd need to fix. Maybe open an issue for it now? |
dtantsur
left a comment
There was a problem hiding this comment.
I think maintenance still needs sorting out.
| nodes.ProvisionStateOpts{Target: nodes.TargetDeleted}, | ||
| ) | ||
|
|
||
| case nodes.CleanFail: |
There was a problem hiding this comment.
I suggest unsetting maintenance before taking any actions.
e3d60b9 to
01bf8de
Compare
|
|
||
| case nodes.Error, nodes.CleanFail: | ||
| case nodes.Error, nodes.DeployFail: | ||
| if !ironicNode.Maintenance { |
|
@longkb What's the status of this PR? Thanks |
This PR still waiting for review now :) I just would like to rebase this commit base on the master branch :) |
01bf8de to
b965e6f
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: longkb The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
The last 2 comments from Dmitry were requesting changes, so this is waiting on you as I understand it. |
Currently, `DeployFaile` is not handled during deprovioning. Besides, we must set target as `manage`, not `deleted` to handle `CleanFail`. This PR aims to fix these bugs. Signed-off-by: Kim Bao Long <longkb@vn.fujitsu.com>
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale This seems important to finish |
|
@stbenjam @zaneb I'd like to figure out the path forward for this. I can take over this patch is the original contributor is not available, but I need to understand what we want in the event of a failure. Is it fine to retry cleaning and deployment forever? Can we somehow limit the number of retries? |
I originally assumed we would want to put the host into some sort of error state and wait to retry until the host was modified somehow. We're seeing failure causes not related to input through the API, though, so I think we want to change our approach and retry. We should still report an error status so that it is counted via prometheus and so we generate an event for entering that state, but then on the next reconciliation attempt we should try to recover and repeat the operation. We should do that indefinitely, except when the host is being deleted. If the host is being deleted we should attempt to deprovision a limited number of times (once?) and then go ahead and clean up ironic and allow the CR to be removed. |
Ideally (and that's what TripleO does) we should retry a small number of times, then report an error. What we have now is an attempt to execute
Won't it look like installation hanging for a user? |
Controllers are supposed to keep trying to reconcile until they finish their work. We have a few error states where we know there is no point in continuing to try, mostly tied to bad credentials. Other cases, where something goes wrong and we can't tell what it is should be retried. I only called out deprovisioning during delete as an exception because we don't want to put the system into a state where the user can't recover at all. There is no way to reverse a delete operation, so we should allow it to proceed without leaving garbage data in the ironic database.
The original description in the ticket is talking about deprovisioning, which shouldn't happen during an installation. It may appear that deprovisioning is hanging, but that's accurate if we're getting an error. |
Oh, I didn't notice that. Well, then retrying makes more sense, and this patch can be merged after my comments are addressed. @longkb, do you still work on this patch? |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale |
|
@zaneb: Closed this PR. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Currently,
DeployFailis not handled in the event ofdeprovioning. Besides, we have to set target asmanage, notdeletedto handleCleanFail(#318). This PR aims to fix these bugs.[1] https://docs.openstack.org/ironic/pike/_images/states.svg
Signed-off-by: Kim Bao Long longkb@vn.fujitsu.com