🐛 Power off nodes upon deletion#1176
Conversation
|
/test-centos-integration-main |
|
/lgtm |
|
Heads up: centOS failure is unrelated to these changes, we are facing issues with CI. xref: https://kubernetes.slack.com/archives/CHD49TLE7/p1666273231196429 |
|
what is the progress here? |
72efc2a to
8b41131
Compare
|
/test-centos-integration-main |
|
/lgtm |
|
/lgtm cancel One issue inline, otherwise looking good. |
8b41131 to
2f4593e
Compare
2f4593e to
277a166
Compare
zaneb
left a comment
There was a problem hiding this comment.
Refamiliarising myself with how all of this works 😅
277a166 to
77fdf84
Compare
|
|
||
| if err != nil { | ||
| if info.host.Status.ErrorCount < maxPowerOffRetryCount { | ||
| return actionError{errors.Wrap(err, "failed to power off")} |
There was a problem hiding this comment.
I'm not seeing anything addressing my earlier comment about an infinite error loop in cases where the node is missing from ironic and we don't have credentials to re-register it, which we handle in deprovisioning here and also now need to handle in deleting.
I wonder if this would all be made simpler by putting the new code into a separate actionPowerOffBeforeDeleting() method, so that the state machine code can easily distinguish between errors coming from the power off vs. the delete.
| } | ||
| } | ||
|
|
||
| info.host.Status.ErrorCount = 0 |
There was a problem hiding this comment.
As mentioned in the last part of https://github.com/metal3-io/baremetal-operator/pull/1176/files#r1053444986, iff the error count wasn't already 0 we'll want to return actionUpdate wrapping actionContinue on line 533.
| } | ||
| return result | ||
| } | ||
| } |
There was a problem hiding this comment.
We should probably also set Status.PoweredOn to false so that we are reporting what we've actually done.
77fdf84 to
b61cb84
Compare
|
This might be overkill but the suggested changes led me to create a new step in the state machine. When a delete is requested, instead of When we clear |
b61cb84 to
1630c13
Compare
|
/test-centos-integration-main |
|
/test-centos-e2e-integration-main |
1630c13 to
2f38157
Compare
|
|
||
| func (hsm *hostStateMachine) handlePoweringOffBeforeDelete(info *reconcileInfo) actionResult { | ||
| actResult := hsm.Reconciler.actionPowerOffBeforeDeleting(hsm.Provisioner, info) | ||
| skipToDelete := func() actionResult { |
There was a problem hiding this comment.
nit: most of this function is repeating handleDeprovisioning. It would be great to refactor them.
|
/test-centos-e2e-integration-main |
|
/assign @zaneb |
We introduce a new step in the state machine where the node goes through a power off stage before it's deleted. We attempt to power it off 3 times before giving up, and proceeding to the delete.
2f38157 to
6f65d8e
Compare
|
/lgtm |
|
/test-centos-e2e-integration-main |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dtantsur The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is a continuation of #816 which in turn tries to fix #410.
Co-authored-by: Sandhya Dasu sadasu@redhat.com @sadasu