From 941f0b143a1ddafe8edb01f9f819104b6462578e Mon Sep 17 00:00:00 2001
From: Marc Sluiter <msluiter@redhat.com>
Date: Wed, 14 Jul 2021 16:36:19 +0200
Subject: [PATCH 1/5] Added enhancement for new pause-mhc-on-upgrade controller

Signed-off-by: Marc Sluiter <msluiter@redhat.com>
---
 .../machine-health-check-pause-on-upgrade.md  | 175 ++++++++++++++++++
 1 file changed, 175 insertions(+)
 create mode 100644 enhancements/machine-api/machine-health-check-pause-on-upgrade.md

diff --git a/enhancements/machine-api/machine-health-check-pause-on-upgrade.md b/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
new file mode 100644
index 0000000000..85b1206f3b
--- /dev/null
+++ b/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
@@ -0,0 +1,175 @@
+---
+title: machine-health-check-pause-on-cluster-upgrade
+
+authors:
+  - @slintes
+
+reviewers:
+  - @JoelSpeed
+  - @michaelgugino
+  - @beekhof
+
+approvers:
+  - @JoelSpeed
+  - @michaelgugino
+  - @beekhof
+
+creation-date: 2021-07-14  
+last-updated: 2021-07-14  
+status: implementable  
+
+see-also:
+  - "enhancements/machine-api/machine-health-checking.md"
+---
+
+# Pause MachineHealthChecks During Cluster Upgrades
+
+## Release Signoff Checklist
+
+- [ ] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Operational readiness criteria is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Summary
+
+Automatically prevent unwanted remediation during cluster upgrades caused by temporarily unhealthy nodes.
+
+## Motivation
+
+During cluster upgrades nodes get temporarily unhealthy, which might trigger remediation when a machineHealthCheck
+resource is targeting those nodes, and they don't get healthy before the configured timeout. This remediation delays
+the cluster upgrade significantly especially on bare metal clusters, and causes needless interruption of customer
+workloads while they are transferred to other nodes.
+
+### Goals
+
+- Automatically pause all machineHealthChecks during cluster upgrades
+- Automatically unpause all machineHealthChecks when cluster upgrade finished
+
+### Non-Goals
+
+tbd
+
+## Proposal
+
+Introduce a new controller, which watches the clusterVersion resource, and pauses all machineHealthChecks when the
+clusterVersion status indicates an ongoing cluster upgrade. The controller will unpause the machineHealthChecks when the
+clusterVersion status indicates that the upgrade process finished.
+
+### User Stories
+
+- As a cluster admin I don't want unneeded reboots of nodes during cluster upgrades
+- As a cluster admin I don't want to manually pause machineHealthChecks during cluster upgrades
+- As a user I don't want my workloads to be transferred to new nodes more often than needed during cluster upgrades
+
+### Implementation Details/Notes/Constraints [optional]
+
+The clusterVersion object has several conditions in its status. Example:
+
+```yaml
+  status:
+    conditions:
+    - lastTransitionTime: "2021-07-05T20:53:02Z"
+      message: Done applying 4.9.0-0.nightly-2021-07-05-092650
+      status: "True"
+      type: Available
+    - lastTransitionTime: "2021-07-06T12:03:02Z"
+      status: "False"
+      type: Failing
+    - lastTransitionTime: "2021-07-05T20:53:02Z"
+      message: Cluster version is 4.9.0-0.nightly-2021-07-05-092650
+      status: "False"
+      type: Progressing
+    - lastTransitionTime: "2021-07-05T20:05:15Z"
+      message: 'Unable to retrieve available updates: currently reconciling cluster
+        version 4.9.0-0.nightly-2021-07-05-092650 not found in the "stable-4.8" channel'
+      reason: VersionNotFound
+      status: "False"
+      type: RetrievedUpdates
+    ...
+```
+
+The machineHealthCheck supports pausing by setting an annotation on it:
+
+```yaml
+apiVersion: machine.openshift.io/v1beta1
+kind: MachineHealthCheck
+metadata:
+  name: example
+  namespace: openshift-machine-api
+  annotations:
+    cluster.x-k8s.io/paused: ""
+  ...
+```
+
+The new controller will add the pause annotation on the machineHealthChecks when the clusterVersion's `Progressing`
+condition switches to `"True"`. When that condition switches back to `"False"`, the controller will remove the
+annotation.
+
+The new controller will be implemented in the `openshift/machine-api-operator` repository, and be deployed in the
+`machine-api-controllers` pod, both alongside the machine-healthcheck-controller.
+
+### Risks and Mitigations
+
+Nothing we are aware of today
+
+## Design Details
+
+### Open Questions [optional]
+
+1. Should admins be able to enable / disable this feature? (IMHO no, I don't see a usecase when this has disadvantages)
+
+2. Today there is no way to coordinate multiple users of the pause feature of machineHealthChecks. The value of the
+annotation is supposed to be empty. This means that the annotation might already be set by someone else when the new
+controller want to set it as well. That introduces the question what to do in this case after cluster upgrade: remove
+the annotation, or keep it? In the latter case: how to keep track whether the controller set the annotation or not (e.g.
+in another annotation?)
+
+### Test Plan
+
+TODO (need to check existing e2e tests, and if we extend them with a test case for this, and how much extra load to CI
+it would add)
+
+### Graduation Criteria
+
+TODO I have no good idea yet
+- if this is alpha, beta or stable (do we need this at all, it doesn't introduce a new API)?
+- if this is Dev Preview, Tech Preview or GA?
+
+#### Dev Preview -> Tech Preview
+
+#### Tech Preview -> GA
+
+#### Removing a deprecated feature
+
+### Upgrade / Downgrade Strategy
+
+The new controller lives in the machine-api-operator image so the upgrades will be driven by the CVO which will fetch
+the right image version as usual.
+The controller does not offer any API or configuration. The controller can be replaced with an older or newer
+version without side effects or risks.
+
+### Version Skew Strategy
+
+This controller has some expectation to other APIs, which are the clusterVersion status conditions, and the
+machineHealthCheck pause annotation. In case any of those changes their API (new condition name or annotaton key), the
+controller will stop to work. This will catched in CI.
+
+## Implementation History
+
+## Drawbacks
+
+## Alternatives
+
+- Not having this feature at all: cluster admins could pause machineHealthChecks manually, but that is unneeded work,
+  and too easy to forget.
+- Implement this in the machine healthcheck controller: this would be an obviuos choice on a first look. But since the
+  clusterVersion resource is Openshift specific, it would further diverge Openshift machine API and upstream Cluster
+  API. The goal is to align them over time.
+
+## Infrastructure Needed [optional]
+
+n/a

From ffefbad6b4543c1e500ebe957ac5b8101318e6a7 Mon Sep 17 00:00:00 2001
From: Marc Sluiter <msluiter@redhat.com>
Date: Thu, 15 Jul 2021 12:05:39 +0200
Subject: [PATCH 2/5] Extended goals, non-goals, drawbacks and more
 alternatives

and fixed some typos etc

Signed-off-by: Marc Sluiter <msluiter@redhat.com>
---
 .../machine-health-check-pause-on-upgrade.md  | 31 +++++++++++++------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/enhancements/machine-api/machine-health-check-pause-on-upgrade.md b/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
index 85b1206f3b..9f2144d479 100644
--- a/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
+++ b/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
@@ -48,16 +48,18 @@ workloads while they are transferred to other nodes.
 
 - Automatically pause all machineHealthChecks during cluster upgrades
 - Automatically unpause all machineHealthChecks when cluster upgrade finished
+- Get a simple solution done in time for OCP 4.9, by using what we have upstream already (the pause annotation), for
+  avoiding worst case behaviour during cluster upgrades. Look for better solutions for future releases
 
 ### Non-Goals
 
-tbd
+- Track the status of each single node (upgrading or not upgrading) for more fine-grained pausing of remediation
 
 ## Proposal
 
-Introduce a new controller, which watches the clusterVersion resource, and pauses all machineHealthChecks when the
-clusterVersion status indicates an ongoing cluster upgrade. The controller will unpause the machineHealthChecks when the
-clusterVersion status indicates that the upgrade process finished.
+Introduce a new controller, which watches clusterVersion and machineHealthCheck resources, and pauses all
+machineHealthChecks when the clusterVersion status indicates an ongoing cluster upgrade. The controller will unpause the
+machineHealthChecks when the clusterVersion status indicates that the upgrade process finished.
 
 ### User Stories
 
@@ -124,7 +126,7 @@ Nothing we are aware of today
 
 2. Today there is no way to coordinate multiple users of the pause feature of machineHealthChecks. The value of the
 annotation is supposed to be empty. This means that the annotation might already be set by someone else when the new
-controller want to set it as well. That introduces the question what to do in this case after cluster upgrade: remove
+controller want to set it as well. That introduces the question of what to do in this case after cluster upgrade: remove
 the annotation, or keep it? In the latter case: how to keep track whether the controller set the annotation or not (e.g.
 in another annotation?)
 
@@ -147,7 +149,7 @@ TODO I have no good idea yet
 
 ### Upgrade / Downgrade Strategy
 
-The new controller lives in the machine-api-operator image so the upgrades will be driven by the CVO which will fetch
+The new controller lives in the machine-api-operator image, so the upgrades will be driven by the CVO which will fetch
 the right image version as usual.
 The controller does not offer any API or configuration. The controller can be replaced with an older or newer
 version without side effects or risks.
@@ -155,20 +157,31 @@ version without side effects or risks.
 ### Version Skew Strategy
 
 This controller has some expectation to other APIs, which are the clusterVersion status conditions, and the
-machineHealthCheck pause annotation. In case any of those changes their API (new condition name or annotaton key), the
-controller will stop to work. This will catched in CI.
+machineHealthCheck pause annotation. In case any of those changes their API (new condition name or annotation key), the
+controller will stop to work. This will be caught in CI.
 
 ## Implementation History
 
 ## Drawbacks
 
+The controller can't stop ongoing external remediation, which is executed by an external remediation controller. For
+this to happen we would either need to introduce and implement a pause flag on the external remediation CRD, or the
+external remediation controller needs to look for the pause annotation on the machineHealthCheck resource.
+
 ## Alternatives
 
 - Not having this feature at all: cluster admins could pause machineHealthChecks manually, but that is unneeded work,
   and too easy to forget.
-- Implement this in the machine healthcheck controller: this would be an obviuos choice on a first look. But since the
+- Implement this in the machine-healthcheck-controller: this would be an obvious choice on a first look. But since the
   clusterVersion resource is Openshift specific, it would further diverge Openshift machine API and upstream Cluster
   API. The goal is to align them over time.
+- Track the status of nodes more fine-grained and continue to remediate nodes that currently are not upgrading: this
+  would be a perfect solution and might be investigated more deeply in the future. For now, with respect to timeline, we
+  want to concentrate on the simple solution of pausing MHCs completely.
+- Use node leases: there are efforts to introduce the concept of node leases
+  (https://github.com/kubernetes/enhancements/pull/1411). They can be used for coordinating parties which want to do
+  destructive actions on nodes, like e.g. upgrades or remediation. This would also result in more fine-grained "pausing"
+  of node remediation. With respect to the timeline we don't want to wait for this effort, but will keep an eye on it.
 
 ## Infrastructure Needed [optional]
 

From 9a40070df0e1fb23a713e9b289041f5061843cf4 Mon Sep 17 00:00:00 2001
From: Marc Sluiter <msluiter@redhat.com>
Date: Fri, 16 Jul 2021 10:11:26 +0200
Subject: [PATCH 3/5] Update
 enhancements/machine-api/machine-health-check-pause-on-upgrade.md

Co-authored-by: Joel Speed <Joel.speed@hotmail.co.uk>
---
 .../machine-api/machine-health-check-pause-on-upgrade.md        | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/enhancements/machine-api/machine-health-check-pause-on-upgrade.md b/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
index 9f2144d479..7fefffda97 100644
--- a/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
+++ b/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
@@ -63,7 +63,7 @@ machineHealthChecks when the clusterVersion status indicates that the upgrade pr
 
 ### User Stories
 
-- As a cluster admin I don't want unneeded reboots of nodes during cluster upgrades
+- As a cluster admin I don't want unneeded remediation/reboots of nodes during cluster upgrades
 - As a cluster admin I don't want to manually pause machineHealthChecks during cluster upgrades
 - As a user I don't want my workloads to be transferred to new nodes more often than needed during cluster upgrades
 

From d3b9bbb2c215bf1a629e516ea671d02baccabf1d Mon Sep 17 00:00:00 2001
From: Marc Sluiter <msluiter@redhat.com>
Date: Fri, 16 Jul 2021 15:13:45 +0200
Subject: [PATCH 4/5] Clarify planned behaviour in case MHCs are paused
 already, or created or upaused during upgrades

Signed-off-by: Marc Sluiter <msluiter@redhat.com>
---
 .../machine-health-check-pause-on-upgrade.md  | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/enhancements/machine-api/machine-health-check-pause-on-upgrade.md b/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
index 7fefffda97..d5dcae07ac 100644
--- a/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
+++ b/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
@@ -47,7 +47,7 @@ workloads while they are transferred to other nodes.
 ### Goals
 
 - Automatically pause all machineHealthChecks during cluster upgrades
-- Automatically unpause all machineHealthChecks when cluster upgrade finished
+- Automatically unpause all machineHealthChecks when cluster upgrade finished, in case they were paused for the upgrade
 - Get a simple solution done in time for OCP 4.9, by using what we have upstream already (the pause annotation), for
   avoiding worst case behaviour during cluster upgrades. Look for better solutions for future releases
 
@@ -57,9 +57,10 @@ workloads while they are transferred to other nodes.
 
 ## Proposal
 
-Introduce a new controller, which watches clusterVersion and machineHealthCheck resources, and pauses all
+Introduce a new controller, which watches clusterVersion and machineHealthCheck resources, and pauses all unpaused
 machineHealthChecks when the clusterVersion status indicates an ongoing cluster upgrade. The controller will unpause the
-machineHealthChecks when the clusterVersion status indicates that the upgrade process finished.
+machineHealthChecks, when the clusterVersion status indicates that the upgrade process finished, and the
+machineHealthChecks were paused by this controller.
 
 ### User Stories
 
@@ -107,10 +108,18 @@ metadata:
   ...
 ```
 
-The new controller will add the pause annotation on the machineHealthChecks when the clusterVersion's `Progressing`
-condition switches to `"True"`. When that condition switches back to `"False"`, the controller will remove the
+When the clusterVersion's `Progressing` condition switches to `"True"`, the new controller will iterate all
+machineHealthCheck resources, and add the pause annotation to all of them which are not paused already. For tracking
+which resources are being paused by the controller, it will also add a `machine.openshift.io/paused-for-upgrade: ""`
 annotation.
 
+During the cluster upgrade, the controller will also pause and mark all newly created machineHealthChecks.
+When a machineHealthCheck is unpaused by another party though, the controller will not pause it again, in order
+to allow admins to explicitly start remediation during upgrades.
+
+When the clusterVersion condition switches back to `"False"`, the controller will remove both annotations from all
+machineHealthChecks, which were paused by this controller.
+
 The new controller will be implemented in the `openshift/machine-api-operator` repository, and be deployed in the
 `machine-api-controllers` pod, both alongside the machine-healthcheck-controller.
 

From f32a8deb6f3b1177dc790fdfe6d327028906dcc7 Mon Sep 17 00:00:00 2001
From: Marc Sluiter <msluiter@redhat.com>
Date: Fri, 16 Jul 2021 15:15:04 +0200
Subject: [PATCH 5/5] Added risk, and drawback of tracking upgrades on each
 node instead of the cluster

Signed-off-by: Marc Sluiter <msluiter@redhat.com>
---
 .../machine-health-check-pause-on-upgrade.md         | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/enhancements/machine-api/machine-health-check-pause-on-upgrade.md b/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
index d5dcae07ac..1cf5cd4de3 100644
--- a/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
+++ b/enhancements/machine-api/machine-health-check-pause-on-upgrade.md
@@ -125,7 +125,8 @@ The new controller will be implemented in the `openshift/machine-api-operator` r
 
 ### Risks and Mitigations
 
-Nothing we are aware of today
+Hosts that are not successfully rebooted as part of the upgrade won't be remediated, until either the upgrade finishes,
+or an admin unpauses the machineHealthChecks manually.
 
 ## Design Details
 
@@ -184,9 +185,12 @@ external remediation controller needs to look for the pause annotation on the ma
 - Implement this in the machine-healthcheck-controller: this would be an obvious choice on a first look. But since the
   clusterVersion resource is Openshift specific, it would further diverge Openshift machine API and upstream Cluster
   API. The goal is to align them over time.
-- Track the status of nodes more fine-grained and continue to remediate nodes that currently are not upgrading: this
-  would be a perfect solution and might be investigated more deeply in the future. For now, with respect to timeline, we
-  want to concentrate on the simple solution of pausing MHCs completely.
+- Track the upgrading status on single nodes instead of the cluster: on one hand this would allow to remediate hosts
+  which are unhealthy but not caused by the upgrade, but on the other hand it might have the risk that both upgrade and
+  remediation drain nodes, and the cluster runs out of capacity. This can result in upgrade and remediation controllers
+  blocking each other from taking action, and the cluster wouldn't be upgradeable or fixable. This might be investigated
+  more deeply in the future, but for now, with respect to timeline, we want to concentrate on the simple solution of
+  pausing MHCs completely.
 - Use node leases: there are efforts to introduce the concept of node leases
   (https://github.com/kubernetes/enhancements/pull/1411). They can be used for coordinating parties which want to do
   destructive actions on nodes, like e.g. upgrades or remediation. This would also result in more fine-grained "pausing"