Changing the default behaviour of the CAPBM to request hard reboot by rdoxenham · Pull Request #138 · openshift/cluster-api-provider-baremetal

rdoxenham · 2021-02-15T22:24:00Z

This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795

n1r1 · 2021-02-16T07:41:08Z

/lgtm

n1r1 · 2021-02-16T07:47:37Z

Hmmm actually I'm not sure this is okay.
I know you've tested it and it worked but I still find it weird that the value is set as part of the annotation key.
Even if it works it could be misleading. For example we have the following line:

baremetalhost.Annotations[requestPowerOffAnnotation] = ""

and also:

_, exists = baremetalhost.Annotations[requestPowerOffAnnotation]

So I would feel much more comfortable to have the value set in this line

rdoxenham · 2021-02-16T09:41:42Z

Hmmm actually I'm not sure this is okay.
I know you've tested it and it worked but I still find it weird that the value is set as part of the annotation key.
Even if it works it could be misleading. For example we have the following line:
baremetalhost.Annotations[requestPowerOffAnnotation] = ""
and also:
_, exists = baremetalhost.Annotations[requestPowerOffAnnotation]
So I would feel much more comfortable to have the value set in this line

Thanks @n1r1 - made the changes, you're absolutely right here.

n1r1 · 2021-02-16T10:23:23Z

thanks @rdoxenham
WDYT about adding this to the tests as well?

should be easy by adding a simple check here or inside the previous if statement

rdoxenham · 2021-02-16T12:33:03Z

thanks @rdoxenham
WDYT about adding this to the tests as well?

should be easy by adding a simple check here or inside the previous if statement

Thanks - I put it into the previous if statement. In reality, it assigns the poweredOffForRemediation prior to requestPowerOffAnnotation but I don't think it matters in this instance. Let me know.

pkg/cloud/baremetal/actuators/machine/actuator.go

n1r1 · 2021-02-16T19:31:01Z

thanks @rdoxenham
WDYT about adding this to the tests as well?
should be easy by adding a simple check here or inside the previous if statement

Thanks - I put it into the previous if statement. In reality, it assigns the poweredOffForRemediation prior to requestPowerOffAnnotation but I don't think it matters in this instance. Let me know.

ah sorry, I was pointing to the wrong line 🤦

You're right, I meant to add the check where we actually add the requestPowerOffAnnotation annotation.
I think it worth checking that the reboot mode was set correctly, so if someone changes this in the future, test will fail.

So maybe adding inside this if or after it, something like:

if host.Annotations[requestPowerOffAnnotation] != `{"mode":"hard"}` {
     t.Log("...")
     t.Fail()
}

rdoxenham · 2021-02-17T14:29:48Z

thanks @rdoxenham
WDYT about adding this to the tests as well?
should be easy by adding a simple check here or inside the previous if statement

Thanks - I put it into the previous if statement. In reality, it assigns the poweredOffForRemediation prior to requestPowerOffAnnotation but I don't think it matters in this instance. Let me know.

ah sorry, I was pointing to the wrong line

You're right, I meant to add the check where we actually add the requestPowerOffAnnotation annotation.
I think it worth checking that the reboot mode was set correctly, so if someone changes this in the future, test will fail.

So maybe adding inside this if or after it, something like:
if host.Annotations[requestPowerOffAnnotation] != `{"mode":"hard"}` {
     t.Log("...")
     t.Fail()
}

Thanks @n1r1, added that test too.

n1r1 · 2021-02-17T18:40:57Z

Thanks.
Since you moved from string to a struct in BMO PR, I guess it makes sense to use it here as well instead of {"mode":"hard"}

rdoxenham · 2021-02-17T19:10:03Z

Thanks.
Since you moved from string to a struct in BMO PR, I guess it makes sense to use it here as well instead of {"mode":"hard"}

@n1r1 sure, can do... forgive the ignorance, where's best to define the struct?

n1r1 · 2021-02-17T19:57:27Z

Thanks.
Since you moved from string to a struct in BMO PR, I guess it makes sense to use it here as well instead of {"mode":"hard"}

@n1r1 sure, can do... forgive the ignorance, where's best to define the struct?

It was already defined in BMO, so you don't need to re-define it here.
The actuator here is already imports the relevant file:

bmh "github.com/metal3-io/baremetal-operator/pkg/apis/metal3/v1alpha1"

You'll probably need to revendor, but I think this will have to wait until that PR merges and backported to openshift/BMO, but I'm not sure

rdoxenham · 2021-02-17T20:29:28Z

Thanks.
Since you moved from string to a struct in BMO PR, I guess it makes sense to use it here as well instead of {"mode":"hard"}

@n1r1 sure, can do... forgive the ignorance, where's best to define the struct?

It was already defined in BMO, so you don't need to re-define it here.
The actuator here is already imports the relevant file:
bmh "github.com/metal3-io/baremetal-operator/pkg/apis/metal3/v1alpha1"
You'll probably need to revendor, but I think this will have to wait until that PR merges and backported to openshift/BMO, but I'm not sure

Oh sweet, I didn't realise you imported that. Perfect... I'll need to wait until it lands, yes. Thx!

rdoxenham · 2021-03-01T17:01:28Z

@n1r1 let me know if this meets your expectations. Thanks!

pkg/cloud/baremetal/actuators/machine/actuator.go

pkg/cloud/baremetal/actuators/machine/actuator_test.go

vendor/github.com/metal3-io/baremetal-operator/pkg/apis/metal3/v1alpha1/baremetalhost_types.go

The default reboot-interface behaviour is to attempt a soft power off, and if this fails, revert to a hard power off (PR openshift#294). For high availability use-cases we require the ability to immediately power-off a node. This PR attempts to address that requirement and is part of a wider solution requiring the CAPBM to set the annotation that we have detailed and implemented in this commit. The baseline provisioner API changes have been provided in an earlier commit. CAPBM PR: openshift/cluster-api-provider-baremetal#138 Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678

dhellmann · 2021-03-03T21:44:04Z

I opened #140 to remove the CRD generation entirely, since I don't think we're using those files at all.

In this commit we're pulling in the latest version of the BMO dependencies via the vendor module, allowing us to utilise newer functions and structs provided by recent PR's in the latest BMO code. This updates to v0.0.0-20210303141721-86a42dcb0150.

rdoxenham · 2021-03-04T22:31:44Z

/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-ipv6

The default reboot-interface behaviour is to attempt a soft power off, and if this fails, revert to a hard power off (PR openshift#294). For high availability use-cases we require the ability to immediately power-off a node. This PR attempts to address that requirement and is part of a wider solution requiring the CAPBM to set the annotation that we have detailed and implemented in this commit. The baseline provisioner API changes have been provided in an earlier commit. CAPBM PR: openshift/cluster-api-provider-baremetal#138 Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678

pkg/cloud/baremetal/actuators/machine/actuator_test.go

hardys · 2021-03-08T15:55:42Z

lgtm, one small nit re the test adjustments, I think one of those landed in the hard-reboot commit, which was initially confusing when reviewing each commit individually. Not blocking on it, but may be worth fixing if you need to rebase or address any other comments.

/lgtm

This change adds an additional mode to the reboot annotation that forces all nodes sent for remediation, e.g. via a MachineHealthCheck, to be forcefully rebooted rather than defaulting to a soft reboot first, as it is today. The primary drive behind this change is to enable quicker recovery of workloads, e.g. for high-availability use cases, and by defaulting to forced hard reboot we can enable functionality very close to fencing. This change shouldn't impact any other non-remediation reboot requests, as the hard reboot functionality only takes place when the mode=hard annotation is applied to the node. All of the work on the BMO can be found in the link below. Whilst we depend on this PR to have a complete solution, we don't have a hard dependency on them merging together. BMO PR: metal3-io/baremetal-operator#795

rdoxenham · 2021-03-08T18:32:44Z

/test e2e-metal-ipi-upgrade

hardys · 2021-03-09T10:01:36Z

/lgtm

openshift-ci-robot · 2021-03-09T10:02:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hardys, n1r1, rdoxenham

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [hardys]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rdoxenham · 2021-03-09T10:10:59Z

/cherry-pick release-4.7

openshift-cherrypick-robot · 2021-03-09T10:11:40Z

@rdoxenham: new pull request created: #143

Details

In response to this:

/cherry-pick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

The default reboot-interface behaviour is to attempt a soft power off, and if this fails, revert to a hard power off (PR openshift#294). For high availability use-cases we require the ability to immediately power-off a node. This PR attempts to address that requirement and is part of a wider solution requiring the CAPBM to set the annotation that we have detailed and implemented in this commit. The baseline provisioner API changes have been provided in an earlier commit. CAPBM PR: openshift/cluster-api-provider-baremetal#138 Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678

zaneb · 2021-03-15T15:12:38Z

pkg/cloud/baremetal/actuators/machine/actuator.go

-	baremetalhost.Annotations[requestPowerOffAnnotation] = ""
+	// Issue a hard reboot for immediate remediation purposes
+	remediationRebootAnnotation := bmh.RebootAnnotationArguments{Mode: bmh.RebootModeHard}
+	remediationString, err := json.Marshal(remediationRebootAnnotation)


We're ignoring any error in marshalling here... it seems like we should at least log it.

@zaneb I don't disagree, but I figured it would be safe given that it's our own struct that we're marshalling, not data that we're getting from a client/user.

zaneb · 2021-03-15T15:14:02Z

pkg/cloud/baremetal/actuators/machine/actuator.go

 	case bmh.StateRegistering:
-		// This case will no longer need to be handled once the changes proposed
-		// in https://github.com/metal3-io/baremetal-operator/pull/388 are
-		// available in the baremetal-operator.


Why delete this comment? Without it we will be carrying this code forever.

@zaneb perhaps a misunderstanding on my behalf, but as we revendored with a new update of the BMO code, these states no longer existed, and therefore seemed pointless to keep the comment given that in that code metal3-io/baremetal-operator#388 had landed. I can follow up with another PR to reinstate the comment if you feel it necessary.

@rdoxenham The comment applied to all 3 states, including StateRegistering. So the code below is now redundant but there's no reminder to us that it can be deleted.

🐛 Add missing namespace ironic configmap

openshift-ci-robot requested review from dhellmann and russellb February 15, 2021 22:24

openshift-ci-robot assigned n1r1 Feb 16, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 16, 2021

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 16, 2021

rdoxenham force-pushed the master branch from f440597 to 6cc0479 Compare February 16, 2021 09:39

rdoxenham force-pushed the master branch from 7d89dcd to 96aae2c Compare February 16, 2021 12:32

dhellmann reviewed Feb 16, 2021

View reviewed changes

pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved

pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved

rdoxenham force-pushed the master branch from 66ea757 to 319a1cd Compare February 16, 2021 15:31

rdoxenham force-pushed the master branch from c4a9c54 to f27c00d Compare February 17, 2021 14:29

rdoxenham force-pushed the master branch from e2df9b7 to 4b5d670 Compare March 1, 2021 15:15

rdoxenham force-pushed the master branch from 0b4e660 to 411cd2f Compare March 1, 2021 17:03

dhellmann reviewed Mar 1, 2021

View reviewed changes

pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved

pkg/cloud/baremetal/actuators/machine/actuator_test.go Outdated Show resolved Hide resolved

pkg/cloud/baremetal/actuators/machine/actuator_test.go Outdated Show resolved Hide resolved

rdoxenham force-pushed the master branch from 448a89b to c3cba6e Compare March 1, 2021 21:17

n1r1 reviewed Mar 2, 2021

View reviewed changes

vendor/github.com/metal3-io/baremetal-operator/pkg/apis/metal3/v1alpha1/baremetalhost_types.go Outdated Show resolved Hide resolved

openshift-ci-robot mentioned this pull request Mar 3, 2021

Bug 1927678: Backporting BMO extensions to support different reboot modes openshift/baremetal-operator#128

Merged

rdoxenham force-pushed the master branch from ccb7511 to 825c34a Compare March 3, 2021 15:48

rdoxenham force-pushed the master branch 3 times, most recently from c521f9b to adcd9e0 Compare March 4, 2021 19:52

rdoxenham requested review from dhellmann and n1r1 March 8, 2021 14:28

hardys reviewed Mar 8, 2021

View reviewed changes

pkg/cloud/baremetal/actuators/machine/actuator_test.go Show resolved Hide resolved

openshift-ci-robot assigned hardys Mar 8, 2021

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 8, 2021

rdoxenham force-pushed the master branch from 4ba9d76 to e14bea6 Compare March 8, 2021 16:05

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 8, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 9, 2021

openshift-merge-robot merged commit 8116e7f into openshift:master Mar 9, 2021

openshift-cherrypick-robot mentioned this pull request Mar 9, 2021

[release-4.7] Changing the default behaviour of the CAPBM to request hard reboot #143

Closed

rdoxenham mentioned this pull request Mar 9, 2021

Bug 1936844: [release-4.7] Changing the default behaviour of the CAPBM to request hard reboot #144

Merged

zaneb reviewed Mar 15, 2021

View reviewed changes

honza pushed a commit to honza/cluster-api-provider-baremetal that referenced this pull request Feb 7, 2022

Merge pull request openshift#138 from Nordix/fix/configmap

c5a73f9

🐛 Add missing namespace ironic configmap

Conversation

rdoxenham commented Feb 15, 2021

Uh oh!

n1r1 commented Feb 16, 2021

Uh oh!

n1r1 commented Feb 16, 2021

Uh oh!

rdoxenham commented Feb 16, 2021

Uh oh!

n1r1 commented Feb 16, 2021

Uh oh!

rdoxenham commented Feb 16, 2021

Uh oh!

Uh oh!

Uh oh!

n1r1 commented Feb 16, 2021

Uh oh!

rdoxenham commented Feb 17, 2021

Uh oh!

n1r1 commented Feb 17, 2021

Uh oh!

rdoxenham commented Feb 17, 2021

Uh oh!

n1r1 commented Feb 17, 2021

Uh oh!

rdoxenham commented Feb 17, 2021

Uh oh!

rdoxenham commented Mar 1, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dhellmann commented Mar 3, 2021

Uh oh!

rdoxenham commented Mar 4, 2021

Uh oh!

Uh oh!

hardys commented Mar 8, 2021

Uh oh!

rdoxenham commented Mar 8, 2021

Uh oh!

hardys commented Mar 9, 2021

Uh oh!

openshift-ci-robot commented Mar 9, 2021

Uh oh!

rdoxenham commented Mar 9, 2021

Uh oh!

openshift-cherrypick-robot commented Mar 9, 2021

Uh oh!

zaneb Mar 15, 2021

Choose a reason for hiding this comment

Uh oh!

rdoxenham Mar 15, 2021

Choose a reason for hiding this comment

Uh oh!

zaneb Mar 15, 2021

Choose a reason for hiding this comment

Uh oh!

rdoxenham Mar 15, 2021

Choose a reason for hiding this comment

Uh oh!

zaneb Apr 5, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants