Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions modules/machine-health-checks-about.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,19 @@ To limit the disruptive impact of machine deletions, the controller drains and d

To stop the check, remove the custom resource.

[id="machine-health-checks-bare-metal_{context}"]
== MachineHealthChecks on Bare Metal

Machine deletion on bare metal cluster triggers reprovisioning of a bare metal host.
Usually bare metal reprovisioning is a lengthy process, during which the cluster
is missing compute resources and applications might be interrupted.
To change the default remediation process from machine deletion to host power-cycle,
annotate the MachineHealthCheck resource with the
`machine.openshift.io/remediation-strategy: external-baremetal` annotation.

After you set the annotation, unhealthy machines are power-cycled by using
BMC credentials.

[id="machine-health-checks-limitations_{context}"]
== Limitations when deploying machine health checks

Expand Down
44 changes: 42 additions & 2 deletions modules/machine-health-checks-resource.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,49 @@
[id="machine-health-checks-resource_{context}"]
= Sample `MachineHealthCheck` resource

The `MachineHealthCheck` resource resembles the following YAML file:
The `MachineHealthCheck` resource resembles one of the following YAML files:

.`MachineHealthCheck`
.`MachineHealthCheck` for bare metal
[source,yaml]
----
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: example <1>
namespace: openshift-machine-api
annotations:
machine.openshift.io/remediation-strategy: external-baremetal <2>
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-machine-role: <role> <3>
machine.openshift.io/cluster-api-machine-type: <role> <3>
machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> <4>
unhealthyConditions:
- type: "Ready"
timeout: "300s" <5>
status: "False"
- type: "Ready"
timeout: "300s" <5>
status: "Unknown"
maxUnhealthy: "40%" <6>
nodeStartupTimeout: "10m" <7>
----

<1> Specify the name of the machine health check to deploy.
<2> For bare metal clusters, you must include the `machine.openshift.io/remediation-strategy: external-baremetal` annotation in the `annotations` section to enable power-cycle remediation. With this remediation strategy, unhealthy hosts are rebooted instead of removed from the cluster.
<3> Specify a label for the machine pool that you want to check.
<4> Specify the machine set to track in `<cluster_name>-<label>-<zone>` format. For example, `prod-node-us-east-1a`.
<5> Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
<6> Specify the amount of unhealthy machines allowed in the targeted pool. This can be set as a percentage or an integer.
<7> Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.

[NOTE]
====
The `matchLabels` are examples only; you must map your machine groups based on your specific needs.
====

.`MachineHealthCheck` for all other installation types
[source,yaml]
----
apiVersion: machine.openshift.io/v1beta1
Expand Down