Skip to content

Changing the default behaviour of the CAPBM to request hard reboot#138

Merged
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
rdoxenham:master
Mar 9, 2021
Merged

Changing the default behaviour of the CAPBM to request hard reboot#138
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
rdoxenham:master

Conversation

@rdoxenham
Copy link

This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795

@n1r1
Copy link

n1r1 commented Feb 16, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 16, 2021
@n1r1
Copy link

n1r1 commented Feb 16, 2021

Hmmm actually I'm not sure this is okay.
I know you've tested it and it worked but I still find it weird that the value is set as part of the annotation key.
Even if it works it could be misleading. For example we have the following line:

baremetalhost.Annotations[requestPowerOffAnnotation] = ""

and also:

_, exists = baremetalhost.Annotations[requestPowerOffAnnotation]

So I would feel much more comfortable to have the value set in this line

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 16, 2021
@rdoxenham
Copy link
Author

Hmmm actually I'm not sure this is okay.
I know you've tested it and it worked but I still find it weird that the value is set as part of the annotation key.
Even if it works it could be misleading. For example we have the following line:

baremetalhost.Annotations[requestPowerOffAnnotation] = ""

and also:

_, exists = baremetalhost.Annotations[requestPowerOffAnnotation]

So I would feel much more comfortable to have the value set in this line

Thanks @n1r1 - made the changes, you're absolutely right here.

@n1r1
Copy link

n1r1 commented Feb 16, 2021

thanks @rdoxenham
WDYT about adding this to the tests as well?

should be easy by adding a simple check here or inside the previous if statement

@rdoxenham
Copy link
Author

thanks @rdoxenham
WDYT about adding this to the tests as well?

should be easy by adding a simple check here or inside the previous if statement

Thanks - I put it into the previous if statement. In reality, it assigns the poweredOffForRemediation prior to requestPowerOffAnnotation but I don't think it matters in this instance. Let me know.

@n1r1
Copy link

n1r1 commented Feb 16, 2021

thanks @rdoxenham
WDYT about adding this to the tests as well?
should be easy by adding a simple check here or inside the previous if statement

Thanks - I put it into the previous if statement. In reality, it assigns the poweredOffForRemediation prior to requestPowerOffAnnotation but I don't think it matters in this instance. Let me know.

ah sorry, I was pointing to the wrong line 🤦

You're right, I meant to add the check where we actually add the requestPowerOffAnnotation annotation.
I think it worth checking that the reboot mode was set correctly, so if someone changes this in the future, test will fail.

So maybe adding inside this if or after it, something like:

if host.Annotations[requestPowerOffAnnotation] != `{"mode":"hard"}` {
     t.Log("...")
     t.Fail()
}

@rdoxenham
Copy link
Author

thanks @rdoxenham
WDYT about adding this to the tests as well?
should be easy by adding a simple check here or inside the previous if statement

Thanks - I put it into the previous if statement. In reality, it assigns the poweredOffForRemediation prior to requestPowerOffAnnotation but I don't think it matters in this instance. Let me know.

ah sorry, I was pointing to the wrong line

You're right, I meant to add the check where we actually add the requestPowerOffAnnotation annotation.
I think it worth checking that the reboot mode was set correctly, so if someone changes this in the future, test will fail.

So maybe adding inside this if or after it, something like:

if host.Annotations[requestPowerOffAnnotation] != `{"mode":"hard"}` {
     t.Log("...")
     t.Fail()
}

Thanks @n1r1, added that test too.

@n1r1
Copy link

n1r1 commented Feb 17, 2021

Thanks.
Since you moved from string to a struct in BMO PR, I guess it makes sense to use it here as well instead of {"mode":"hard"}

@rdoxenham
Copy link
Author

Thanks.
Since you moved from string to a struct in BMO PR, I guess it makes sense to use it here as well instead of {"mode":"hard"}

@n1r1 sure, can do... forgive the ignorance, where's best to define the struct?

@n1r1
Copy link

n1r1 commented Feb 17, 2021

Thanks.
Since you moved from string to a struct in BMO PR, I guess it makes sense to use it here as well instead of {"mode":"hard"}

@n1r1 sure, can do... forgive the ignorance, where's best to define the struct?

It was already defined in BMO, so you don't need to re-define it here.
The actuator here is already imports the relevant file:

bmh "github.com/metal3-io/baremetal-operator/pkg/apis/metal3/v1alpha1"

You'll probably need to revendor, but I think this will have to wait until that PR merges and backported to openshift/BMO, but I'm not sure

@rdoxenham
Copy link
Author

Thanks.
Since you moved from string to a struct in BMO PR, I guess it makes sense to use it here as well instead of {"mode":"hard"}

@n1r1 sure, can do... forgive the ignorance, where's best to define the struct?

It was already defined in BMO, so you don't need to re-define it here.
The actuator here is already imports the relevant file:

bmh "github.com/metal3-io/baremetal-operator/pkg/apis/metal3/v1alpha1"

You'll probably need to revendor, but I think this will have to wait until that PR merges and backported to openshift/BMO, but I'm not sure

Oh sweet, I didn't realise you imported that. Perfect... I'll need to wait until it lands, yes. Thx!

@rdoxenham
Copy link
Author

@n1r1 let me know if this meets your expectations. Thanks!

rdoxenham added a commit to rdoxenham/baremetal-operator that referenced this pull request Mar 2, 2021
The default reboot-interface behaviour is to attempt a soft power
off, and if this fails, revert to a hard power off (PR openshift#294). For high
availability use-cases we require the ability to immediately power-off
a node. This PR attempts to address that requirement and is part of a
wider solution requiring the CAPBM to set the annotation that we have
detailed and implemented in this commit. The baseline provisioner API
changes have been provided in an earlier commit.

CAPBM PR: openshift/cluster-api-provider-baremetal#138
Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
@dhellmann
Copy link

I opened #140 to remove the CRD generation entirely, since I don't think we're using those files at all.

In this commit we're pulling in the latest version of the BMO
dependencies via the vendor module, allowing us to utilise newer
functions and structs provided by recent PR's in the latest BMO
code. This updates to v0.0.0-20210303141721-86a42dcb0150.
@rdoxenham rdoxenham force-pushed the master branch 3 times, most recently from c521f9b to adcd9e0 Compare March 4, 2021 19:52
@rdoxenham
Copy link
Author

/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-ipv6

rdoxenham added a commit to rdoxenham/baremetal-operator that referenced this pull request Mar 8, 2021
The default reboot-interface behaviour is to attempt a soft power
off, and if this fails, revert to a hard power off (PR openshift#294). For high
availability use-cases we require the ability to immediately power-off
a node. This PR attempts to address that requirement and is part of a
wider solution requiring the CAPBM to set the annotation that we have
detailed and implemented in this commit. The baseline provisioner API
changes have been provided in an earlier commit.

CAPBM PR: openshift/cluster-api-provider-baremetal#138
Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
@rdoxenham rdoxenham requested review from dhellmann and n1r1 March 8, 2021 14:28
@hardys
Copy link

hardys commented Mar 8, 2021

lgtm, one small nit re the test adjustments, I think one of those landed in the hard-reboot commit, which was initially confusing when reviewing each commit individually. Not blocking on it, but may be worth fixing if you need to rebase or address any other comments.

/lgtm

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 8, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 8, 2021
@rdoxenham
Copy link
Author

/test e2e-metal-ipi-upgrade

@hardys
Copy link

hardys commented Mar 9, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 9, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hardys, n1r1, rdoxenham

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 8116e7f into openshift:master Mar 9, 2021
@rdoxenham
Copy link
Author

/cherry-pick release-4.7

@openshift-cherrypick-robot

@rdoxenham: new pull request created: #143

Details

In response to this:

/cherry-pick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rdoxenham added a commit to rdoxenham/baremetal-operator that referenced this pull request Mar 9, 2021
The default reboot-interface behaviour is to attempt a soft power
off, and if this fails, revert to a hard power off (PR openshift#294). For high
availability use-cases we require the ability to immediately power-off
a node. This PR attempts to address that requirement and is part of a
wider solution requiring the CAPBM to set the annotation that we have
detailed and implemented in this commit. The baseline provisioner API
changes have been provided in an earlier commit.

CAPBM PR: openshift/cluster-api-provider-baremetal#138
Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
rdoxenham added a commit to rdoxenham/baremetal-operator that referenced this pull request Mar 9, 2021
The default reboot-interface behaviour is to attempt a soft power
off, and if this fails, revert to a hard power off (PR openshift#294). For high
availability use-cases we require the ability to immediately power-off
a node. This PR attempts to address that requirement and is part of a
wider solution requiring the CAPBM to set the annotation that we have
detailed and implemented in this commit. The baseline provisioner API
changes have been provided in an earlier commit.

CAPBM PR: openshift/cluster-api-provider-baremetal#138
Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
baremetalhost.Annotations[requestPowerOffAnnotation] = ""
// Issue a hard reboot for immediate remediation purposes
remediationRebootAnnotation := bmh.RebootAnnotationArguments{Mode: bmh.RebootModeHard}
remediationString, err := json.Marshal(remediationRebootAnnotation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're ignoring any error in marshalling here... it seems like we should at least log it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaneb I don't disagree, but I figured it would be safe given that it's our own struct that we're marshalling, not data that we're getting from a client/user.

case bmh.StateRegistering:
// This case will no longer need to be handled once the changes proposed
// in https://github.com/metal3-io/baremetal-operator/pull/388 are
// available in the baremetal-operator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this comment? Without it we will be carrying this code forever.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaneb perhaps a misunderstanding on my behalf, but as we revendored with a new update of the BMO code, these states no longer existed, and therefore seemed pointless to keep the comment given that in that code metal3-io/baremetal-operator#388 had landed. I can follow up with another PR to reinstate the comment if you feel it necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdoxenham The comment applied to all 3 states, including StateRegistering. So the code below is now redundant but there's no reminder to us that it can be deleted.

honza pushed a commit to honza/cluster-api-provider-baremetal that referenced this pull request Feb 7, 2022
🐛 Add missing namespace ironic configmap
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants