Attempt to hard power off node before it is deleted#816
Attempt to hard power off node before it is deleted#816sadasu wants to merge 3 commits intometal3-io:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sadasu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/assign @andfasano |
|
/test-integration |
| expectedPowerState: "", | ||
| expectedError: "failed to remove host", | ||
| }, | ||
| } |
There was a problem hiding this comment.
A couple of another cases are required to manage when p.hardPowerOff() returns an error
There was a problem hiding this comment.
Using https://github.com/metal3-io/baremetal-operator/blob/master/pkg/provisioner/ironic/testserver/ironic.go#L227, I wasn't able to mock the out come for hardPowerOff() which is called from within the Delete().
There was a problem hiding this comment.
That's a real interesting case and it's worth some additional explanation.
Adding the mocking function to test case should be ok, ie:
ironic: testserver.NewIronic(t).Node(
nodes.Node{
UUID: nodeUUID,
ProvisionState: "active",
Maintenance: true,
PowerState: powerOn,
},
).WithNodeStatesPowerUpdate(nodeUUID, http.StatusConflict).Delete(nodeUUID),
In this case the expected error is returned, anyhow this type assertion https://github.com/sadasu/baremetal-operator/blob/4d0f31405af55c03516533cd6e624aa6c32afaa2/pkg/provisioner/ironic/ironic.go#L1624 is going to fail because the real error is wrapped (twice).
As per the golang recommended best practices on errors (https://blog.golang.org/go1.13-errors), type check must be changed using the new errors functions like:
var hostErr *HostLockedError
if errors.As(err, &hostErr) {
p.log.Info("could not power off host, busy")
return retryAfterDelay(powerRequeueDelay)
} else {
return operationFailed("failed to power off host")
}
But that's not yet sufficient, because the underlying changePower does not return a pointer receiver for the error: https://github.com/sadasu/baremetal-operator/blob/4d0f31405af55c03516533cd6e624aa6c32afaa2/pkg/provisioner/ironic/ironic.go#L1681
So that code as well must be changed to:
return result, &HostLockedError{Address: p.host.Spec.BMC.Address}
This means of course that other points in the code (ie PowerOn) will require a deeper review for properly managing the error
There was a problem hiding this comment.
I have attempt to do this in c941bc0. This commit might need more work. The test for this is here: https://github.com/metal3-io/baremetal-operator/pull/816/files#diff-969e3b93b7bf85d6166287117b76766bd03994870cb36c0f26dce54cf2c11a50R147.
| ), | ||
| priorErrors: 2, | ||
| expectedPowerState: "power off", | ||
| expectedError: "", |
|
/test-integration |
zaneb
left a comment
There was a problem hiding this comment.
There's one other problem we'll have to address, which is that hardPowerOff() doesn't set (and Delete doesn't check for) an ErrorMessage when there is a LastError from ironic. So if all the ironic calls succeed but ironic cannot actually change the power state, then we will keep retrying forever.
Fun fact: this also means that we currently never report a power management error. I raised #828 to record this. Basically I don't think we will be able to complete this until that issue is fixed.
a7ac374 to
2649e51
Compare
|
/test-integration |
Let the controller monitor the number of retries and communicate behavior to provisioner via force flag.
|
/test-integration |
|
The baremetal host controller is checking for the host's error count before performing some but not all actions. A host's error count should inform whether the controller would perform any action on the host. Does the controller have a higher error threshold for the overall number of errors and not call any action on that host? And should this checked before the action is attempted and not within the logic for each action? |
| PowerState: powerOn, | ||
| }, | ||
| ).WithNodeStatesPower(nodeUUID, http.StatusConflict).WithNodeStatesPowerUpdate(nodeUUID, http.StatusConflict), | ||
| expectedDirty: false, |
There was a problem hiding this comment.
These expected values are consistent with what is returned by transientError(). http.StatusConflict is not resulting in HostLockedError.
|
/test-integration |
|
Hello! What is the status with this PR ? Is it still blocked ? |
|
@sadasu looks like the PR needs a rebase |
|
@sadasu: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
|
@sadasu: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Stale issues close after 30d of inactivity. Reopen the issue with /close |
|
@metal3-io-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When a BareMetalHost is deleted, power it off before performing
the delete operation.
Fixes #410