Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 25 additions & 5 deletions enhancements/machine-api/machine-health-checking.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,24 +96,37 @@ For a node notFound or a failed machine, the machine is considerable unrecoverab
- The machine controller provider implementation deletes the cloud instance.
- The machine controller deletes the machine resource.

### Pausing

Some cluster operations, e.g. upgrades, result in temporarily unhealthy machines / nodes, which might trigger
unnecessary remediation. To allow cluster admins or new controllers to prevent this from happening without having to
delete and re-create machineHealthCheck objects, we will implement a `pause` feature on the machineHealthCheck resource.
This feature already exists on the upstream machineHealthCheck resource in the form of an annotation, which we want to
backport. Its key is `cluster.x-k8s.io/paused`. While this isn't consistent with existing downstream annotation keys, it
will make future alignment with Cluster API easier. Remediation will be paused as soon as this annotation exists. Its
value isn't checked but is expected to be empty.

### User Stories
- I want my worker machines to be remediated when the backed node has `ready=false` or `ready=Unknown` condition for more than 10m.
- I want remediation to temporary short-circuit if the 40% or more of the targets of this pool are unhealthy at the same time.
- I want to prevent remediation, without deleting the entire MHC configuration, while my cluster is upgrading its machines / nodes.

### Implementation Details

#### MachineHealthCheck CRD
- Enable watching a pool of machines (based on a label selector).
- Enable defining an unhealthy node criteria (based on a list of node conditions).
- Enable setting a threshold of unhealthy nodes. If the current number is at or above this threshold no further remediation will take place. This can be expressed as an int or as a percentage of the total targets in the pool.

E.g:
- I want my worker machines to be remediated when the backed node has `ready=false` or `ready=Unknown` condition for more than 10m.
- I want remediation to temporary short-circuit if the 40% or more of the targets of this pool are unhealthy at the same time.

- Enable pausing of remediation

```yaml
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: example
namespace: openshift-machine-api
annotations:
cluster.x-k8s.io/paused: ""
spec:
selector:
matchLabels:
Expand All @@ -137,6 +150,7 @@ Watch:
- Watch machines and nodes with an event handler e.g controller runtime `EnqueueRequestsFromMapFunc` which returns machineHealthCheck resources.

Reconcile:
- Don't do anything when the pause annotation is set on the machineHealthCheck resource.
- Fetch all the machines in the pool and operate over machine/node targets. E.g:
```go
type target struct {
Expand Down Expand Up @@ -186,6 +200,12 @@ This feature will be tested for public clouds in the e2e machine API suite as th
### Graduation Criteria
An implementation of this feature is currently gated behind the `TechPreviewNoUpgrade` flag. This proposal wants to remove the gating flag and promote machine health check to a GA status with a beta API.

#### Dev Preview -> Tech Preview

#### Tech Preview -> GA

#### Removing a deprecated feature

### Upgrade / Downgrade Strategy

The machine health check controller lives in the machine-api-operator image so the upgrades will be driven by the CVO which will fetch the right image version as usual. See:
Expand Down