Changing the default behaviour of the CAPBM to request hard reboot#138
Changing the default behaviour of the CAPBM to request hard reboot#138openshift-merge-robot merged 2 commits intoopenshift:masterfrom
Conversation
|
/lgtm |
|
Hmmm actually I'm not sure this is okay. baremetalhost.Annotations[requestPowerOffAnnotation] = ""and also: _, exists = baremetalhost.Annotations[requestPowerOffAnnotation]So I would feel much more comfortable to have the value set in this line |
Thanks @n1r1 - made the changes, you're absolutely right here. |
|
thanks @rdoxenham should be easy by adding a simple check here or inside the previous |
Thanks - I put it into the previous |
ah sorry, I was pointing to the wrong line 🤦 You're right, I meant to add the check where we actually add the So maybe adding inside this if or after it, something like: if host.Annotations[requestPowerOffAnnotation] != `{"mode":"hard"}` {
t.Log("...")
t.Fail()
} |
Thanks @n1r1, added that test too. |
|
Thanks. |
@n1r1 sure, can do... forgive the ignorance, where's best to define the struct? |
It was already defined in BMO, so you don't need to re-define it here. bmh "github.com/metal3-io/baremetal-operator/pkg/apis/metal3/v1alpha1"You'll probably need to revendor, but I think this will have to wait until that PR merges and backported to openshift/BMO, but I'm not sure |
Oh sweet, I didn't realise you imported that. Perfect... I'll need to wait until it lands, yes. Thx! |
|
@n1r1 let me know if this meets your expectations. Thanks! |
vendor/github.com/metal3-io/baremetal-operator/pkg/apis/metal3/v1alpha1/baremetalhost_types.go
Outdated
Show resolved
Hide resolved
The default reboot-interface behaviour is to attempt a soft power off, and if this fails, revert to a hard power off (PR openshift#294). For high availability use-cases we require the ability to immediately power-off a node. This PR attempts to address that requirement and is part of a wider solution requiring the CAPBM to set the annotation that we have detailed and implemented in this commit. The baseline provisioner API changes have been provided in an earlier commit. CAPBM PR: openshift/cluster-api-provider-baremetal#138 Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
|
I opened #140 to remove the CRD generation entirely, since I don't think we're using those files at all. |
In this commit we're pulling in the latest version of the BMO dependencies via the vendor module, allowing us to utilise newer functions and structs provided by recent PR's in the latest BMO code. This updates to v0.0.0-20210303141721-86a42dcb0150.
c521f9b to
adcd9e0
Compare
|
/test e2e-metal-ipi |
The default reboot-interface behaviour is to attempt a soft power off, and if this fails, revert to a hard power off (PR openshift#294). For high availability use-cases we require the ability to immediately power-off a node. This PR attempts to address that requirement and is part of a wider solution requiring the CAPBM to set the annotation that we have detailed and implemented in this commit. The baseline provisioner API changes have been provided in an earlier commit. CAPBM PR: openshift/cluster-api-provider-baremetal#138 Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
|
lgtm, one small nit re the test adjustments, I think one of those landed in the hard-reboot commit, which was initially confusing when reviewing each commit individually. Not blocking on it, but may be worth fixing if you need to rebase or address any other comments. /lgtm |
This change adds an additional mode to the reboot annotation that forces all nodes sent for remediation, e.g. via a MachineHealthCheck, to be forcefully rebooted rather than defaulting to a soft reboot first, as it is today. The primary drive behind this change is to enable quicker recovery of workloads, e.g. for high-availability use cases, and by defaulting to forced hard reboot we can enable functionality very close to fencing. This change shouldn't impact any other non-remediation reboot requests, as the hard reboot functionality only takes place when the mode=hard annotation is applied to the node. All of the work on the BMO can be found in the link below. Whilst we depend on this PR to have a complete solution, we don't have a hard dependency on them merging together. BMO PR: metal3-io/baremetal-operator#795
|
/test e2e-metal-ipi-upgrade |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hardys, n1r1, rdoxenham The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherry-pick release-4.7 |
|
@rdoxenham: new pull request created: #143 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The default reboot-interface behaviour is to attempt a soft power off, and if this fails, revert to a hard power off (PR openshift#294). For high availability use-cases we require the ability to immediately power-off a node. This PR attempts to address that requirement and is part of a wider solution requiring the CAPBM to set the annotation that we have detailed and implemented in this commit. The baseline provisioner API changes have been provided in an earlier commit. CAPBM PR: openshift/cluster-api-provider-baremetal#138 Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
The default reboot-interface behaviour is to attempt a soft power off, and if this fails, revert to a hard power off (PR openshift#294). For high availability use-cases we require the ability to immediately power-off a node. This PR attempts to address that requirement and is part of a wider solution requiring the CAPBM to set the annotation that we have detailed and implemented in this commit. The baseline provisioner API changes have been provided in an earlier commit. CAPBM PR: openshift/cluster-api-provider-baremetal#138 Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
| baremetalhost.Annotations[requestPowerOffAnnotation] = "" | ||
| // Issue a hard reboot for immediate remediation purposes | ||
| remediationRebootAnnotation := bmh.RebootAnnotationArguments{Mode: bmh.RebootModeHard} | ||
| remediationString, err := json.Marshal(remediationRebootAnnotation) |
There was a problem hiding this comment.
We're ignoring any error in marshalling here... it seems like we should at least log it.
There was a problem hiding this comment.
@zaneb I don't disagree, but I figured it would be safe given that it's our own struct that we're marshalling, not data that we're getting from a client/user.
| case bmh.StateRegistering: | ||
| // This case will no longer need to be handled once the changes proposed | ||
| // in https://github.com/metal3-io/baremetal-operator/pull/388 are | ||
| // available in the baremetal-operator. |
There was a problem hiding this comment.
Why delete this comment? Without it we will be carrying this code forever.
There was a problem hiding this comment.
@zaneb perhaps a misunderstanding on my behalf, but as we revendored with a new update of the BMO code, these states no longer existed, and therefore seemed pointless to keep the comment given that in that code metal3-io/baremetal-operator#388 had landed. I can follow up with another PR to reinstate the comment if you feel it necessary.
There was a problem hiding this comment.
@rdoxenham The comment applied to all 3 states, including StateRegistering. So the code below is now redundant but there's no reminder to us that it can be deleted.
🐛 Add missing namespace ironic configmap
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.
All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.
BMO PR: metal3-io/baremetal-operator#795