Skip to content

Handle DeployFail and CleanFail state in deprovisioning#289

Closed
longkb wants to merge 1 commit intometal3-io:masterfrom
longkb:handle_deployfail_state_in_deprovisioning
Closed

Handle DeployFail and CleanFail state in deprovisioning#289
longkb wants to merge 1 commit intometal3-io:masterfrom
longkb:handle_deployfail_state_in_deprovisioning

Conversation

@longkb
Copy link
Copy Markdown
Contributor

@longkb longkb commented Aug 27, 2019

Currently, DeployFail is not handled in the event of deprovioning. Besides, we have to set target as manage, not deleted to handle CleanFail (#318). This PR aims to fix these bugs.

[1] https://docs.openstack.org/ironic/pike/_images/states.svg

Signed-off-by: Kim Bao Long longkb@vn.fujitsu.com

@rdoxenham
Copy link
Copy Markdown
Contributor

Hey @longkb, thanks for the PR, but doesn't this assume that we're happy with cleaning failing? It looks like your code puts nodes that have failed cleaning back into 'manage' mode, which should make them available. Is that your intention?

@longkb
Copy link
Copy Markdown
Contributor Author

longkb commented Aug 27, 2019

Hey @longkb, thanks for the PR, but doesn't this assume that we're happy with cleaning failing? It looks like your code puts nodes that have failed cleaning back into 'manage' mode, which should make them available. Is that your intention?

Yes, I would like to bring Ironic node from clean fail to manage in the event of deprovisioning. This state transaction is same with inspect fail[1]

[1] https://github.com/metal3-io/baremetal-operator/blob/master/pkg/provisioner/ironic/ironic.go#L1153


case nodes.Error, nodes.CleanFail:
case nodes.Error, nodes.DeployFail:
if !ironicNode.Maintenance {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this was done, it doesn't seem to be needed in either case. Actually, automated cleaning will timeout and fail if maintenance is on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we need to take care not only automated cleaning, but also manual cleaning (for RAID and BIOS configuration)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but I'm not sure how it is related to my comment: any in-band clean steps will timeout and fail if in maintenance.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs addressing.

nodes.ProvisionStateOpts{Target: nodes.TargetDeleted},
)

case nodes.CleanFail:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CleanFail always results in maintenance mode, we have to unset it before we can do manage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to do manage verb from cleanfail without enter maintenance mode, and it worked smoothly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but you'll be left with maintenance mode on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ha, I have checked the workflow, and you was right. So should I unset maintenance flag after jump from cleanfail to manage?

	case nodes.CleanFail:
		result, err = p.changeNodeProvisionState(
			ironicNode,
			nodes.ProvisionStateOpts{Target: nodes.TargetManage},
		)
		if ironicNode.Maintenance {
			p.log.Info("unset host maintenance flag to make host ready again")
			p.setMaintenanceFlag(ironicNode, false)
		}
		return result, err

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest unsetting maintenance before taking any actions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reviewing. The maintenance flag is unset before executing manage verb now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently done after doing manage, which is not correct. Essentially no actions should be taken in maintenance.

@dtantsur
Copy link
Copy Markdown
Member

Yes, I would like to bring Ironic node from clean fail to manage in the event of deprovisioning

If we ever enable full cleaning, this ^^ will risk leaving user's data on disks. Since we don't use full cleaning, it's probably fine.

@longkb
Copy link
Copy Markdown
Contributor Author

longkb commented Aug 27, 2019

If we ever enable full cleaning, this ^^ will risk leaving user's data on disks. Since we don't use full cleaning, it's probably fine.

My use case is ironic node entered clean fail during provisioning, then we need to turn back to manageable with manage verb, not deleted verb as shown in deprovisoning code

@longkb longkb force-pushed the handle_deployfail_state_in_deprovisioning branch from 7f16c45 to e3d60b9 Compare August 28, 2019 04:06
@nordixinfra
Copy link
Copy Markdown

Can one of the admins verify this patch?

@russellb
Copy link
Copy Markdown
Member

add to whitelist

@russellb
Copy link
Copy Markdown
Member

/test-integration

@russellb
Copy link
Copy Markdown
Member

/test-centos-integration

Copy link
Copy Markdown

@nordixinfra nordixinfra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown

@nordixinfra nordixinfra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown

@nordixinfra nordixinfra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown

@nordixinfra nordixinfra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@russellb
Copy link
Copy Markdown
Member

@zaneb can you take a look at this one?

@zaneb
Copy link
Copy Markdown
Member

zaneb commented Sep 25, 2019

This seems consistent with the state diagram, although clearly @dtantsur is the expert here.

Yes, I would like to bring Ironic node from clean fail to manage in the event of deprovisioning

If we ever enable full cleaning, this ^^ will risk leaving user's data on disks. Since we don't use full cleaning, it's probably fine.

This is somewhat concerning. From an upstream project perspective, it seems likely that eventually there will be people who want to use metal3 both with full cleaning as well as those who want to use it without. It would nice if there were some way of at least documenting that this is something we'd need to fix. Maybe open an issue for it now?

Copy link
Copy Markdown
Member

@dtantsur dtantsur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maintenance still needs sorting out.

nodes.ProvisionStateOpts{Target: nodes.TargetDeleted},
)

case nodes.CleanFail:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest unsetting maintenance before taking any actions.

@longkb longkb force-pushed the handle_deployfail_state_in_deprovisioning branch from e3d60b9 to 01bf8de Compare October 9, 2019 01:18
@metal3-io-bot metal3-io-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 9, 2019

case nodes.Error, nodes.CleanFail:
case nodes.Error, nodes.DeployFail:
if !ironicNode.Maintenance {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs addressing.

@stbenjam
Copy link
Copy Markdown
Member

@longkb What's the status of this PR? Thanks

@longkb
Copy link
Copy Markdown
Contributor Author

longkb commented Jan 16, 2020

@longkb What's the status of this PR? Thanks

This PR still waiting for review now :) I just would like to rebase this commit base on the master branch :)

@longkb longkb force-pushed the handle_deployfail_state_in_deprovisioning branch from 01bf8de to b965e6f Compare January 16, 2020 01:20
@metal3-io-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: longkb
To complete the pull request process, please assign stbenjam
You can assign the PR to them by writing /assign @stbenjam in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zaneb
Copy link
Copy Markdown
Member

zaneb commented Jan 16, 2020

The last 2 comments from Dmitry were requesting changes, so this is waiting on you as I understand it.

Currently, `DeployFaile` is not handled during deprovioning. Besides, we must set target as `manage`, not `deleted` to handle `CleanFail`. This PR aims to fix these bugs.

Signed-off-by: Kim Bao Long <longkb@vn.fujitsu.com>
@metal3-io-bot
Copy link
Copy Markdown
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2020
@dtantsur
Copy link
Copy Markdown
Member

/remove-lifecycle stale

This seems important to finish

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2020
@dtantsur
Copy link
Copy Markdown
Member

@stbenjam @zaneb I'd like to figure out the path forward for this. I can take over this patch is the original contributor is not available, but I need to understand what we want in the event of a failure.

Is it fine to retry cleaning and deployment forever? Can we somehow limit the number of retries?

@dhellmann
Copy link
Copy Markdown
Member

@stbenjam @zaneb I'd like to figure out the path forward for this. I can take over this patch is the original contributor is not available, but I need to understand what we want in the event of a failure.

Is it fine to retry cleaning and deployment forever? Can we somehow limit the number of retries?

I originally assumed we would want to put the host into some sort of error state and wait to retry until the host was modified somehow. We're seeing failure causes not related to input through the API, though, so I think we want to change our approach and retry.

We should still report an error status so that it is counted via prometheus and so we generate an event for entering that state, but then on the next reconciliation attempt we should try to recover and repeat the operation. We should do that indefinitely, except when the host is being deleted. If the host is being deleted we should attempt to deprovision a limited number of times (once?) and then go ahead and clean up ironic and allow the CR to be removed.

@dtantsur
Copy link
Copy Markdown
Member

I originally assumed we would want to put the host into some sort of error state and wait to retry until the host was modified somehow. We're seeing failure causes not related to input through the API, though, so I think we want to change our approach and retry.

Ideally (and that's what TripleO does) we should retry a small number of times, then report an error. What we have now is an attempt to execute manage on a node in deploy fail, which is doomed to fail.

We should do that indefinitely, except when the host is being deleted

Won't it look like installation hanging for a user?

@dhellmann
Copy link
Copy Markdown
Member

I originally assumed we would want to put the host into some sort of error state and wait to retry until the host was modified somehow. We're seeing failure causes not related to input through the API, though, so I think we want to change our approach and retry.

Ideally (and that's what TripleO does) we should retry a small number of times, then report an error. What we have now is an attempt to execute manage on a node in deploy fail, which is doomed to fail.

Controllers are supposed to keep trying to reconcile until they finish their work. We have a few error states where we know there is no point in continuing to try, mostly tied to bad credentials. Other cases, where something goes wrong and we can't tell what it is should be retried. I only called out deprovisioning during delete as an exception because we don't want to put the system into a state where the user can't recover at all. There is no way to reverse a delete operation, so we should allow it to proceed without leaving garbage data in the ironic database.

We should do that indefinitely, except when the host is being deleted

Won't it look like installation hanging for a user?

The original description in the ticket is talking about deprovisioning, which shouldn't happen during an installation. It may appear that deprovisioning is hanging, but that's accurate if we're getting an error.

@dtantsur
Copy link
Copy Markdown
Member

dtantsur commented May 4, 2020

The original description in the ticket is talking about deprovisioning

Oh, I didn't notice that. Well, then retrying makes more sense, and this patch can be merged after my comments are addressed.

@longkb, do you still work on this patch?

@metal3-io-bot
Copy link
Copy Markdown
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2020
@dtantsur
Copy link
Copy Markdown
Member

dtantsur commented Aug 2, 2020

/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2020
@metal3-io-bot
Copy link
Copy Markdown
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2020
@zaneb
Copy link
Copy Markdown
Member

zaneb commented Nov 2, 2020

/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 2, 2020
@zaneb
Copy link
Copy Markdown
Member

zaneb commented Jan 15, 2021

Resolved by #745 and #716, respectively.
/close

@metal3-io-bot
Copy link
Copy Markdown
Contributor

@zaneb: Closed this PR.

Details

In response to this:

Resolved by #745 and #716, respectively.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants