From 5df8563a5ed0f6f9638fb397ceb1ac157fbc7eff Mon Sep 17 00:00:00 2001
From: Marc Sluiter <msluiter@redhat.com>
Date: Thu, 8 Jul 2021 15:54:41 +0200
Subject: [PATCH 1/4] Propose to backport the `pause` feature for
 MachineHealthChecks

Signed-off-by: Marc Sluiter <msluiter@redhat.com>
---
 .../machine-api/machine-health-checking.md        | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/enhancements/machine-api/machine-health-checking.md b/enhancements/machine-api/machine-health-checking.md
index 002ffecac1..1b28f28ac0 100644
--- a/enhancements/machine-api/machine-health-checking.md
+++ b/enhancements/machine-api/machine-health-checking.md
@@ -96,17 +96,27 @@ For a node notFound or a failed machine, the machine is considerable unrecoverab
 - The machine controller provider implementation deletes the cloud instance.
 - The machine controller deletes the machine resource.
 
+### Pausing
+
+Some cluster operations, e.g. upgrades, result in temporarily unhealthy machines / nodes, which might trigger
+unnecessary remediation. To allow cluster admins or new controllers to prevent this from happening, we will implement a
+`pause` feature on the machineHealthCheck resource. This feature already exists on the upstream machineHealthCheck
+resource in the form of an annotation, which we want to backport. Its key is `cluster.x-k8s.io/paused`, and remediation
+will be paused as soon as this annotation exists. However, its value can be for used for syncing between multiple
+parties which want to use the annotation.
+
 ### Implementation Details
 
 #### MachineHealthCheck CRD
 - Enable watching a pool of machines (based on a label selector).
 - Enable defining an unhealthy node criteria (based on a list of node conditions).
 - Enable setting a threshold of unhealthy nodes. If the current number is at or above this threshold no further remediation will take place. This can be expressed as an int or as a percentage of the total targets in the pool.
+- Enable pausing of remediation
 
 E.g:
 - I want my worker machines to be remediated when the backed node has `ready=false` or `ready=Unknown` condition for more than 10m.
 - I want remediation to temporary short-circuit if the 40% or more of the targets of this pool are unhealthy at the same time.
-
+- I want no remediation to happen while my cluster is upgrading its machines / nodes.
 
 ```yaml
 apiVersion: machine.openshift.io/v1beta1
@@ -114,6 +124,8 @@ kind: MachineHealthCheck
 metadata:
   name: example
   namespace: openshift-machine-api
+  annotations:
+    cluster.x-k8s.io/paused: "clusterUpgrading"
 spec:
   selector:
     matchLabels:
@@ -137,6 +149,7 @@ Watch:
 - Watch machines and nodes with an event handler e.g controller runtime `EnqueueRequestsFromMapFunc` which returns machineHealthCheck resources.
 
 Reconcile:
+- Don't start remediation in case the pause annotation is set on the machineHealthCheck resource.
 - Fetch all the machines in the pool and operate over machine/node targets. E.g:
 ```go
 type target struct {

From b75f9ade0a46f73a609714c24e34b3adbde5ccfb Mon Sep 17 00:00:00 2001
From: Marc Sluiter <msluiter@redhat.com>
Date: Thu, 8 Jul 2021 16:16:24 +0200
Subject: [PATCH 2/4] Changes for matching updated template

Signed-off-by: Marc Sluiter <msluiter@redhat.com>
---
 .../machine-api/machine-health-checking.md       | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/enhancements/machine-api/machine-health-checking.md b/enhancements/machine-api/machine-health-checking.md
index 1b28f28ac0..78e82f8df1 100644
--- a/enhancements/machine-api/machine-health-checking.md
+++ b/enhancements/machine-api/machine-health-checking.md
@@ -105,6 +105,11 @@ resource in the form of an annotation, which we want to backport. Its key is `cl
 will be paused as soon as this annotation exists. However, its value can be for used for syncing between multiple
 parties which want to use the annotation.
 
+### User Stories
+- I want my worker machines to be remediated when the backed node has `ready=false` or `ready=Unknown` condition for more than 10m.
+- I want remediation to temporary short-circuit if the 40% or more of the targets of this pool are unhealthy at the same time.
+- I want no remediation to happen while my cluster is upgrading its machines / nodes.
+
 ### Implementation Details
 
 #### MachineHealthCheck CRD
@@ -113,11 +118,6 @@ parties which want to use the annotation.
 - Enable setting a threshold of unhealthy nodes. If the current number is at or above this threshold no further remediation will take place. This can be expressed as an int or as a percentage of the total targets in the pool.
 - Enable pausing of remediation
 
-E.g:
-- I want my worker machines to be remediated when the backed node has `ready=false` or `ready=Unknown` condition for more than 10m.
-- I want remediation to temporary short-circuit if the 40% or more of the targets of this pool are unhealthy at the same time.
-- I want no remediation to happen while my cluster is upgrading its machines / nodes.
-
 ```yaml
 apiVersion: machine.openshift.io/v1beta1
 kind: MachineHealthCheck
@@ -199,6 +199,12 @@ This feature will be tested for public clouds in the e2e machine API suite as th
 ### Graduation Criteria
 An implementation of this feature is currently gated behind the `TechPreviewNoUpgrade` flag. This proposal wants to remove the gating flag and promote machine health check to a GA status with a beta API.
 
+#### Dev Preview -> Tech Preview
+
+#### Tech Preview -> GA
+
+#### Removing a deprecated feature
+
 ### Upgrade / Downgrade Strategy
 
 The machine health check controller lives in the machine-api-operator image so the upgrades will be driven by the CVO which will fetch the right image version as usual. See:

From 85d3e045b5c7a34bf0ebfec6060715613a51e52e Mon Sep 17 00:00:00 2001
From: Marc Sluiter <msluiter@redhat.com>
Date: Tue, 13 Jul 2021 10:21:59 +0200
Subject: [PATCH 3/4] Update:

- removed suggested usage of the annotation value, to be aligned with Cluster API
- added remark of the annotation key (be aligned with upstream)
- improved reasoning (no remediation without the need to delete MHCs)

Signed-off-by: Marc Sluiter <msluiter@redhat.com>
---
 .../machine-api/machine-health-checking.md        | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/enhancements/machine-api/machine-health-checking.md b/enhancements/machine-api/machine-health-checking.md
index 78e82f8df1..60f1000e01 100644
--- a/enhancements/machine-api/machine-health-checking.md
+++ b/enhancements/machine-api/machine-health-checking.md
@@ -99,16 +99,17 @@ For a node notFound or a failed machine, the machine is considerable unrecoverab
 ### Pausing
 
 Some cluster operations, e.g. upgrades, result in temporarily unhealthy machines / nodes, which might trigger
-unnecessary remediation. To allow cluster admins or new controllers to prevent this from happening, we will implement a
-`pause` feature on the machineHealthCheck resource. This feature already exists on the upstream machineHealthCheck
-resource in the form of an annotation, which we want to backport. Its key is `cluster.x-k8s.io/paused`, and remediation
-will be paused as soon as this annotation exists. However, its value can be for used for syncing between multiple
-parties which want to use the annotation.
+unnecessary remediation. To allow cluster admins or new controllers to prevent this from happening without having to
+delete and re-create machineHealthCheck objects, we will implement a `pause` feature on the machineHealthCheck resource.
+This feature already exists on the upstream machineHealthCheck resource in the form of an annotation, which we want to
+backport. Its key is `cluster.x-k8s.io/paused`. While this isn't consistent with existing downstream annotation keys, it
+will make future alignment with Cluster API easier. Remediation will be paused as soon as this annotation exists. Its
+value isn't checked but is expected to be empty.
 
 ### User Stories
 - I want my worker machines to be remediated when the backed node has `ready=false` or `ready=Unknown` condition for more than 10m.
 - I want remediation to temporary short-circuit if the 40% or more of the targets of this pool are unhealthy at the same time.
-- I want no remediation to happen while my cluster is upgrading its machines / nodes.
+- I want to prevent remediation, without deleting the entire MHC configuration, while my cluster is upgrading its machines / nodes.
 
 ### Implementation Details
 
@@ -125,7 +126,7 @@ metadata:
   name: example
   namespace: openshift-machine-api
   annotations:
-    cluster.x-k8s.io/paused: "clusterUpgrading"
+    cluster.x-k8s.io/paused: ""
 spec:
   selector:
     matchLabels:

From b3ba277db045d9297d5ed47d289103f7349ff6a5 Mon Sep 17 00:00:00 2001
From: Marc Sluiter <msluiter@redhat.com>
Date: Wed, 14 Jul 2021 17:33:24 +0200
Subject: [PATCH 4/4] Clarify that reconcilation won't do anything while MHC is
 paused

in contrast to just not starting remediation

Signed-off-by: Marc Sluiter <msluiter@redhat.com>
---
 enhancements/machine-api/machine-health-checking.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/enhancements/machine-api/machine-health-checking.md b/enhancements/machine-api/machine-health-checking.md
index 60f1000e01..9e1a1c5fdc 100644
--- a/enhancements/machine-api/machine-health-checking.md
+++ b/enhancements/machine-api/machine-health-checking.md
@@ -150,7 +150,7 @@ Watch:
 - Watch machines and nodes with an event handler e.g controller runtime `EnqueueRequestsFromMapFunc` which returns machineHealthCheck resources.
 
 Reconcile:
-- Don't start remediation in case the pause annotation is set on the machineHealthCheck resource.
+- Don't do anything when the pause annotation is set on the machineHealthCheck resource.
 - Fetch all the machines in the pool and operate over machine/node targets. E.g:
 ```go
 type target struct {