Handle `DeployFail` and `CleanFail` state in deprovisioning by longkb · Pull Request #289 · metal3-io/baremetal-operator

longkb · 2019-08-27T09:20:43Z

Currently, DeployFail is not handled in the event of deprovioning. Besides, we have to set target as manage, not deleted to handle CleanFail (#318). This PR aims to fix these bugs.

[1] https://docs.openstack.org/ironic/pike/_images/states.svg

Signed-off-by: Kim Bao Long longkb@vn.fujitsu.com

rdoxenham · 2019-08-27T09:25:50Z

Hey @longkb, thanks for the PR, but doesn't this assume that we're happy with cleaning failing? It looks like your code puts nodes that have failed cleaning back into 'manage' mode, which should make them available. Is that your intention?

longkb · 2019-08-27T09:34:29Z

Hey @longkb, thanks for the PR, but doesn't this assume that we're happy with cleaning failing? It looks like your code puts nodes that have failed cleaning back into 'manage' mode, which should make them available. Is that your intention?

Yes, I would like to bring Ironic node from clean fail to manage in the event of deprovisioning. This state transaction is same with inspect fail[1]

[1] https://github.com/metal3-io/baremetal-operator/blob/master/pkg/provisioner/ironic/ironic.go#L1153

dtantsur · 2019-08-27T12:57:51Z


-	case nodes.Error, nodes.CleanFail:
+	case nodes.Error, nodes.DeployFail:
 		if !ironicNode.Maintenance {


I'm not sure why this was done, it doesn't seem to be needed in either case. Actually, automated cleaning will timeout and fail if maintenance is on.

IMO, we need to take care not only automated cleaning, but also manual cleaning (for RAID and BIOS configuration)

Right, but I'm not sure how it is related to my comment: any in-band clean steps will timeout and fail if in maintenance.

This still needs addressing.

dtantsur · 2019-08-27T12:58:20Z

 			nodes.ProvisionStateOpts{Target: nodes.TargetDeleted},
 		)

+	case nodes.CleanFail:


CleanFail always results in maintenance mode, we have to unset it before we can do manage.

I try to do manage verb from cleanfail without enter maintenance mode, and it worked smoothly.

Right, but you'll be left with maintenance mode on.

Ah ha, I have checked the workflow, and you was right. So should I unset maintenance flag after jump from cleanfail to manage?

case nodes.CleanFail: result, err = p.changeNodeProvisionState( ironicNode, nodes.ProvisionStateOpts{Target: nodes.TargetManage}, ) if ironicNode.Maintenance { p.log.Info("unset host maintenance flag to make host ready again") p.setMaintenanceFlag(ironicNode, false) } return result, err

I suggest unsetting maintenance before taking any actions.

Thanks for your reviewing. The maintenance flag is unset before executing manage verb now.

It's currently done after doing manage, which is not correct. Essentially no actions should be taken in maintenance.

dtantsur · 2019-08-27T13:00:23Z

Yes, I would like to bring Ironic node from clean fail to manage in the event of deprovisioning

If we ever enable full cleaning, this ^^ will risk leaving user's data on disks. Since we don't use full cleaning, it's probably fine.

longkb · 2019-08-27T13:20:24Z

If we ever enable full cleaning, this ^^ will risk leaving user's data on disks. Since we don't use full cleaning, it's probably fine.

My use case is ironic node entered clean fail during provisioning, then we need to turn back to manageable with manage verb, not deleted verb as shown in deprovisoning code

nordixinfra · 2019-08-28T12:41:08Z

Can one of the admins verify this patch?

russellb · 2019-09-24T12:30:05Z

add to whitelist

russellb · 2019-09-24T12:31:01Z

/test-integration

russellb · 2019-09-24T12:31:09Z

/test-centos-integration

nordixinfra

💞 Test pending : https://jenkins.nordix.org/job/airship_metal3io_bmo_integration_test_centos/1/

nordixinfra

💞 Test pending : https://jenkins.nordix.org/job/airship_metal3io_bmo_integration_test_ubuntu/1/

nordixinfra

💔 Test Failure : https://jenkins.nordix.org/job/airship_metal3io_bmo_integration_test_centos/1/

nordixinfra

💚 Test Success : https://jenkins.nordix.org/job/airship_metal3io_bmo_integration_test_ubuntu/1/

russellb · 2019-09-24T20:41:39Z

@zaneb can you take a look at this one?

zaneb · 2019-09-25T19:34:35Z

This seems consistent with the state diagram, although clearly @dtantsur is the expert here.

Yes, I would like to bring Ironic node from clean fail to manage in the event of deprovisioning

If we ever enable full cleaning, this ^^ will risk leaving user's data on disks. Since we don't use full cleaning, it's probably fine.

This is somewhat concerning. From an upstream project perspective, it seems likely that eventually there will be people who want to use metal3 both with full cleaning as well as those who want to use it without. It would nice if there were some way of at least documenting that this is something we'd need to fix. Maybe open an issue for it now?

dtantsur

I think maintenance still needs sorting out.

dtantsur · 2019-10-08T11:29:10Z

 			nodes.ProvisionStateOpts{Target: nodes.TargetDeleted},
 		)

+	case nodes.CleanFail:


I suggest unsetting maintenance before taking any actions.

dtantsur · 2019-10-09T08:41:45Z


-	case nodes.Error, nodes.CleanFail:
+	case nodes.Error, nodes.DeployFail:
 		if !ironicNode.Maintenance {


This still needs addressing.

stbenjam · 2020-01-15T14:23:15Z

@longkb What's the status of this PR? Thanks

longkb · 2020-01-16T01:20:17Z

@longkb What's the status of this PR? Thanks

This PR still waiting for review now :) I just would like to rebase this commit base on the master branch :)

metal3-io-bot · 2020-01-16T01:20:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: longkb
To complete the pull request process, please assign stbenjam
You can assign the PR to them by writing /assign @stbenjam in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zaneb · 2020-01-16T04:19:31Z

The last 2 comments from Dmitry were requesting changes, so this is waiting on you as I understand it.

Currently, `DeployFaile` is not handled during deprovioning. Besides, we must set target as `manage`, not `deleted` to handle `CleanFail`. This PR aims to fix these bugs. Signed-off-by: Kim Bao Long <longkb@vn.fujitsu.com>

metal3-io-bot · 2020-04-15T04:45:52Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

dtantsur · 2020-04-23T15:33:17Z

/remove-lifecycle stale

This seems important to finish

dtantsur · 2020-04-24T08:48:11Z

@stbenjam @zaneb I'd like to figure out the path forward for this. I can take over this patch is the original contributor is not available, but I need to understand what we want in the event of a failure.

Is it fine to retry cleaning and deployment forever? Can we somehow limit the number of retries?

dhellmann · 2020-04-27T13:39:10Z

@stbenjam @zaneb I'd like to figure out the path forward for this. I can take over this patch is the original contributor is not available, but I need to understand what we want in the event of a failure.

Is it fine to retry cleaning and deployment forever? Can we somehow limit the number of retries?

I originally assumed we would want to put the host into some sort of error state and wait to retry until the host was modified somehow. We're seeing failure causes not related to input through the API, though, so I think we want to change our approach and retry.

We should still report an error status so that it is counted via prometheus and so we generate an event for entering that state, but then on the next reconciliation attempt we should try to recover and repeat the operation. We should do that indefinitely, except when the host is being deleted. If the host is being deleted we should attempt to deprovision a limited number of times (once?) and then go ahead and clean up ironic and allow the CR to be removed.

dtantsur · 2020-04-27T13:44:48Z

I originally assumed we would want to put the host into some sort of error state and wait to retry until the host was modified somehow. We're seeing failure causes not related to input through the API, though, so I think we want to change our approach and retry.

Ideally (and that's what TripleO does) we should retry a small number of times, then report an error. What we have now is an attempt to execute manage on a node in deploy fail, which is doomed to fail.

We should do that indefinitely, except when the host is being deleted

Won't it look like installation hanging for a user?

dhellmann · 2020-04-27T14:03:05Z

I originally assumed we would want to put the host into some sort of error state and wait to retry until the host was modified somehow. We're seeing failure causes not related to input through the API, though, so I think we want to change our approach and retry.

Ideally (and that's what TripleO does) we should retry a small number of times, then report an error. What we have now is an attempt to execute manage on a node in deploy fail, which is doomed to fail.

Controllers are supposed to keep trying to reconcile until they finish their work. We have a few error states where we know there is no point in continuing to try, mostly tied to bad credentials. Other cases, where something goes wrong and we can't tell what it is should be retried. I only called out deprovisioning during delete as an exception because we don't want to put the system into a state where the user can't recover at all. There is no way to reverse a delete operation, so we should allow it to proceed without leaving garbage data in the ironic database.

We should do that indefinitely, except when the host is being deleted

Won't it look like installation hanging for a user?

The original description in the ticket is talking about deprovisioning, which shouldn't happen during an installation. It may appear that deprovisioning is hanging, but that's accurate if we're getting an error.

dtantsur · 2020-05-04T13:50:08Z

The original description in the ticket is talking about deprovisioning

Oh, I didn't notice that. Well, then retrying makes more sense, and this patch can be merged after my comments are addressed.

@longkb, do you still work on this patch?

metal3-io-bot · 2020-08-02T14:23:31Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

dtantsur · 2020-08-02T19:45:07Z

/remove-lifecycle stale

metal3-io-bot · 2020-10-31T20:13:49Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

zaneb · 2020-11-02T17:15:45Z

/remove-lifecycle stale

zaneb · 2021-01-15T20:24:14Z

Resolved by #745 and #716, respectively.
/close

metal3-io-bot · 2021-01-15T20:24:17Z

@zaneb: Closed this PR.

Details

In response to this:

Resolved by #745 and #716, respectively.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dtantsur requested changes Aug 27, 2019

View reviewed changes

longkb force-pushed the handle_deployfail_state_in_deprovisioning branch from 7f16c45 to e3d60b9 Compare August 28, 2019 04:06

longkb mentioned this pull request Sep 24, 2019

Extend BaremetalHost CRD to support BIOS configuration in baremetal server #302

Merged

nordixinfra reviewed Sep 24, 2019

View reviewed changes

dtantsur requested changes Oct 8, 2019

View reviewed changes

longkb force-pushed the handle_deployfail_state_in_deprovisioning branch from e3d60b9 to 01bf8de Compare October 9, 2019 01:18

metal3-io-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 9, 2019

dtantsur requested changes Oct 9, 2019

View reviewed changes

longkb force-pushed the handle_deployfail_state_in_deprovisioning branch from 01bf8de to b965e6f Compare January 16, 2020 01:20

Handle DeployFail and CleanFail state in deprovisioning

b965e6f

Currently, `DeployFaile` is not handled during deprovioning. Besides, we must set target as `manage`, not `deleted` to handle `CleanFail`. This PR aims to fix these bugs. Signed-off-by: Kim Bao Long <longkb@vn.fujitsu.com>

metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2020

metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2020

metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2020

metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2020

metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2020

metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 2, 2020

metal3-io-bot closed this Jan 15, 2021

Conversation

longkb commented Aug 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdoxenham commented Aug 27, 2019

Uh oh!

longkb commented Aug 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dtantsur commented Aug 27, 2019

Uh oh!

longkb commented Aug 27, 2019

Uh oh!

nordixinfra commented Aug 28, 2019

Uh oh!

russellb commented Sep 24, 2019

Uh oh!

russellb commented Sep 24, 2019

Uh oh!

russellb commented Sep 24, 2019

Uh oh!

nordixinfra left a comment

Choose a reason for hiding this comment

Uh oh!

nordixinfra left a comment

Choose a reason for hiding this comment

Uh oh!

nordixinfra left a comment

Choose a reason for hiding this comment

Uh oh!

nordixinfra left a comment

Choose a reason for hiding this comment

Uh oh!

russellb commented Sep 24, 2019

Uh oh!

zaneb commented Sep 25, 2019

Uh oh!

dtantsur left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stbenjam commented Jan 15, 2020

Uh oh!

longkb commented Jan 16, 2020

Uh oh!

metal3-io-bot commented Jan 16, 2020

Uh oh!

zaneb commented Jan 16, 2020

Uh oh!

metal3-io-bot commented Apr 15, 2020

Uh oh!

dtantsur commented Apr 23, 2020

Uh oh!

dtantsur commented Apr 24, 2020

Uh oh!

longkb commented Aug 27, 2019 •

edited

Loading

longkb commented Aug 27, 2019 •

edited

Loading