Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 201 additions & 0 deletions enhancements/machine-api/machine-health-check-pause-on-upgrade.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
---
title: machine-health-check-pause-on-cluster-upgrade

authors:
- @slintes

reviewers:
- @JoelSpeed
- @michaelgugino
- @beekhof

approvers:
- @JoelSpeed
- @michaelgugino
- @beekhof

creation-date: 2021-07-14
last-updated: 2021-07-14
status: implementable

see-also:
- "enhancements/machine-api/machine-health-checking.md"
---

# Pause MachineHealthChecks During Cluster Upgrades

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Operational readiness criteria is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

Automatically prevent unwanted remediation during cluster upgrades caused by temporarily unhealthy nodes.

## Motivation

During cluster upgrades nodes get temporarily unhealthy, which might trigger remediation when a machineHealthCheck
resource is targeting those nodes, and they don't get healthy before the configured timeout. This remediation delays
the cluster upgrade significantly especially on bare metal clusters, and causes needless interruption of customer
workloads while they are transferred to other nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a problem we've seen in practice? Do we know the cause?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came up with a customer case, so we have seen it in the wild.

The issue was that the MCO reboot took longer than the configured timeout for the MHC. I don't know the full details of that MHC but I'm assuming they had a reasonable timeout configured. We should double check this and add some context here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that just misconfiguration, I configured a MHC without properly understanding the upper bound of the time required to reboot a host and the Kubelet to start posting Ready? Should our docs around adding MHCs make it clear you'll probably want to measure the time it takes your hosts to perform a MCO initiated reboot before setting these values?

Do we really need different rules applied during upgrades versus non upgrade times? If I apply an MC change that triggers a reboot that wouldn't be during an upgrade but would trigger the same behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need different rules applied during upgrades versus non upgrade times? If I apply an MC change that triggers a reboot that wouldn't be during an upgrade but would trigger the same behavior.

I think I'm kind of leaning this direction myself. If you hosts typically take 30 minutes to reboot, set the MHC timeout to 45 minutes. IMO, MHC isn't meant for 'emergency' situations where a host has to be replaced ASAP. It's more of a nice-to-have when you have programmable infra, so if a host dies in the middle of the night, it will be replaced and you won't slowly lose machines over time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need different rules applied during upgrades versus non upgrade times? If I apply an MC change that triggers a reboot that wouldn't be during an upgrade but would trigger the same behavior.

Fair point 👍🏼 We could catch this by looking at machineConfigPools MachineConfigPoolUpdating condition though (suggested at another place already) instead of the cluster version's status, right?

If you hosts typically take 30 minutes to reboot, set the MHC timeout to 45 minutes. IMO, MHC isn't meant for 'emergency' situations

That's also a fair point, but I know we have users / customers which want fencing as fast as possible.

@beekhof @JoelSpeed WDYT?

Copy link

@n1r1 n1r1 Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MHC isn't meant for 'emergency' situations where a host has to be replaced ASAP. It's more of a nice-to-have when you have programmable infra, so if a host dies in the middle of the night, it will be replaced and you won't slowly lose machines over time.

This only true for the compute capacity of the unhealthy node but not for the workload that was assigned to that node.
if you have stateful workloads running on the unhealthy node, you'll have downtime until remediation takes place.
45 minutes downtime for a user application is a lot, IMO, and we saw this need coming from customers comparing OCP to other products.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

45 minutes downtime for a user application is a lot, IMO, and we saw this need coming from customers comparing OCP to other products.

If MHC is disabled for one reason or another during upgrades, this situation could persist indefinitely (eg, you're upgrading host 1, host 2 dies, problem upgrading host 1). Similarly, PDBs can prevent MHC from successfully deleting a failed node indefinitely.

Let's try to put some context around the MHC usecase/userstory. If you weren't running MHC, and you had a random worker go down in the middle of the night, you'd need to wait for X minutes to be paged, someone to log in, etc, and decide to remediate the node manually. That's probably going to take 30 minutes or so, and that's only if you decide to page for a single node failure or have 24/7 ops.

MHC is not a cluster saving or application saving utility. It's to slowly replace failed nodes. Pretty much everywhere but baremetal won't encounter this issue anyhow (other platform reboots are with a couple minutes), so like I said elsewhere, if we're going to add something like this, it should be opt-in.


### Goals

- Automatically pause all machineHealthChecks during cluster upgrades
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does OCP upgrade nodes one by one?
if yes, I'd say that a perfect solution would know if a specific node is unhealthy due to upgrade or something else, so we won't miss remediations that can take place even during upgrade (see lease in alternatives comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try walking before we attempt to run

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't contradicts.
it helps to know what is the vision/end-goal so we can:

  1. know what are the limitations of the proposed solution
  2. Make steps at the right direction

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this to non-goals and alternatives

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you actually care about the machineConfigPool being in progressing state rather than the upgrade state. A full hour of the upgrade has nothing to do with host reboots and if we deliver on paused-worker-pool upgrades most nodes in a cluster won't even reboot some of the time.

I think you should instead watch for nodes that have been cordoned and ignore them in some manner. That seems more broadly applicable.

@JoelSpeed do you have an opinion on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you actually care about the machineConfigPool being in progressing state rather than the upgrade state.

I agree, if we can track the MCP upgrade rather than whole cluster upgrade, that will be a smaller time window. Perhaps the easiest way to do this is to actually track the progressing state of the ClusterOperator for machine config?

I think you should instead watch for nodes that have been cordoned and ignore them in some manner. That seems more broadly applicable.

This seems more like the node maintenance lease idea that has floated around before, kubernetes/enhancements#1411, this is definitely something that could be more intuitive and avoid the dependency on upgrade specific watching, but is a much larger piece of work.

In terms of just watching cordoned nodes, to me this makes sense as a signal that a node is going into maintenance of some description. Need to think a bit more about if this would cause other issues 🤔

- Automatically unpause all machineHealthChecks when cluster upgrade finished, in case they were paused for the upgrade
- Get a simple solution done in time for OCP 4.9, by using what we have upstream already (the pause annotation), for
avoiding worst case behaviour during cluster upgrades. Look for better solutions for future releases

### Non-Goals

- Track the status of each single node (upgrading or not upgrading) for more fine-grained pausing of remediation

## Proposal

Introduce a new controller, which watches clusterVersion and machineHealthCheck resources, and pauses all unpaused
machineHealthChecks when the clusterVersion status indicates an ongoing cluster upgrade. The controller will unpause the
machineHealthChecks, when the clusterVersion status indicates that the upgrade process finished, and the
machineHealthChecks were paused by this controller.

### User Stories

- As a cluster admin I don't want unneeded remediation/reboots of nodes during cluster upgrades
- As a cluster admin I don't want to manually pause machineHealthChecks during cluster upgrades
- As a user I don't want my workloads to be transferred to new nodes more often than needed during cluster upgrades

### Implementation Details/Notes/Constraints [optional]

The clusterVersion object has several conditions in its status. Example:

```yaml
status:
conditions:
- lastTransitionTime: "2021-07-05T20:53:02Z"
message: Done applying 4.9.0-0.nightly-2021-07-05-092650
status: "True"
type: Available
- lastTransitionTime: "2021-07-06T12:03:02Z"
status: "False"
type: Failing
- lastTransitionTime: "2021-07-05T20:53:02Z"
message: Cluster version is 4.9.0-0.nightly-2021-07-05-092650
status: "False"
type: Progressing
- lastTransitionTime: "2021-07-05T20:05:15Z"
message: 'Unable to retrieve available updates: currently reconciling cluster
version 4.9.0-0.nightly-2021-07-05-092650 not found in the "stable-4.8" channel'
reason: VersionNotFound
status: "False"
type: RetrievedUpdates
...
```

The machineHealthCheck supports pausing by setting an annotation on it:

```yaml
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: example
namespace: openshift-machine-api
annotations:
cluster.x-k8s.io/paused: ""
...
```

When the clusterVersion's `Progressing` condition switches to `"True"`, the new controller will iterate all
machineHealthCheck resources, and add the pause annotation to all of them which are not paused already. For tracking
which resources are being paused by the controller, it will also add a `machine.openshift.io/paused-for-upgrade: ""`
annotation.

During the cluster upgrade, the controller will also pause and mark all newly created machineHealthChecks.
When a machineHealthCheck is unpaused by another party though, the controller will not pause it again, in order
to allow admins to explicitly start remediation during upgrades.

When the clusterVersion condition switches back to `"False"`, the controller will remove both annotations from all
machineHealthChecks, which were paused by this controller.

The new controller will be implemented in the `openshift/machine-api-operator` repository, and be deployed in the
`machine-api-controllers` pod, both alongside the machine-healthcheck-controller.

### Risks and Mitigations

Hosts that are not successfully rebooted as part of the upgrade won't be remediated, until either the upgrade finishes,
or an admin unpauses the machineHealthChecks manually.

## Design Details

### Open Questions [optional]

1. Should admins be able to enable / disable this feature? (IMHO no, I don't see a usecase when this has disadvantages)

2. Today there is no way to coordinate multiple users of the pause feature of machineHealthChecks. The value of the
annotation is supposed to be empty. This means that the annotation might already be set by someone else when the new
controller want to set it as well. That introduces the question of what to do in this case after cluster upgrade: remove
the annotation, or keep it? In the latter case: how to keep track whether the controller set the annotation or not (e.g.
in another annotation?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user has already paused remediation, I don't think we want an upgrade to remove that setting. I think we should use a secondary annotation to indicate that this automated controller has added the pause so that it knows to only remove the pause if it sees both annotations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of the annotation is supposed to be empty.

Is this documented somewhere? Is this enforced anywhere?

I had imagined we would set the value to the owner of the pause, as it were, so the upgrade controller would put its name as the value, and then only remove the annotation if the value matches (this would allow others to pause using their own value as well)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use a secondary annotation to indicate that this automated controller has added the pause

ack, agree, will update

The value of the annotation is supposed to be empty.

Is this documented somewhere? Is this enforced anywhere?

No and no, but that was the answer in the sig cluster api Slack when I asked about it.
https://kubernetes.slack.com/archives/C8TSNPY4T/p1626101734178700
also see #827 (comment)


### Test Plan

TODO (need to check existing e2e tests, and if we extend them with a test case for this, and how much extra load to CI
it would add)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test to an existing job?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add this to the existing MHC tests that exist in cluster-api-actuator-pkg


### Graduation Criteria

TODO I have no good idea yet
- if this is alpha, beta or stable (do we need this at all, it doesn't introduce a new API)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotation is technically an API, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the annotation is an API IMO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but this enhancement is not about adding that annotation, it's about using it only
#827

- if this is Dev Preview, Tech Preview or GA?

#### Dev Preview -> Tech Preview

#### Tech Preview -> GA

#### Removing a deprecated feature

### Upgrade / Downgrade Strategy

The new controller lives in the machine-api-operator image, so the upgrades will be driven by the CVO which will fetch
the right image version as usual.
The controller does not offer any API or configuration. The controller can be replaced with an older or newer
version without side effects or risks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If health checks are paused on a cluster before the upgrade to this version is started, then when the new controller comes online and sees the upgrade complete it will erase the pause annotations. I don't think we want that, because it won't leave the cluster in the same state as it was when we started and the change could cause unwanted degradation in service. See the comment above about the 2-annotation approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If health checks are paused on a cluster before the upgrade to this version is started

when everything goes as planned, it is very very unlikely that anyone added the pause annotation to machineHealthChecks already, because the machine-healthcheck-controller version, which respects the pause annotation (and so the relevant doc update), will only be introduced with the same OCP / machine-api-operator version as this new controller. But I will keep this in mind in case this does not go as planned, thanks.


### Version Skew Strategy

This controller has some expectation to other APIs, which are the clusterVersion status conditions, and the
machineHealthCheck pause annotation. In case any of those changes their API (new condition name or annotation key), the
controller will stop to work. This will be caught in CI.

## Implementation History

## Drawbacks

The controller can't stop ongoing external remediation, which is executed by an external remediation controller. For
this to happen we would either need to introduce and implement a pause flag on the external remediation CRD, or the
external remediation controller needs to look for the pause annotation on the machineHealthCheck resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if the remediation controller indicates that the cluster is not upgradeable when it starts trying to fix a host?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can't follow. What do you mean with "the remediation controller indicates that the cluster is not upgradeable"?


## Alternatives

- Not having this feature at all: cluster admins could pause machineHealthChecks manually, but that is unneeded work,
and too easy to forget.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be scripted, right?

How widely used is the machine health check controller? Is that an optional component or is it always present when we have the machine API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MHC comes by default in an OCP cluster, but it isn't configured unless a customer creates a MHC object in the cluster, so it's technically off by default

- Implement this in the machine-healthcheck-controller: this would be an obvious choice on a first look. But since the
clusterVersion resource is Openshift specific, it would further diverge Openshift machine API and upstream Cluster
API. The goal is to align them over time.
- Track the upgrading status on single nodes instead of the cluster: on one hand this would allow to remediate hosts
which are unhealthy but not caused by the upgrade, but on the other hand it might have the risk that both upgrade and
remediation drain nodes, and the cluster runs out of capacity. This can result in upgrade and remediation controllers
blocking each other from taking action, and the cluster wouldn't be upgradeable or fixable. This might be investigated
more deeply in the future, but for now, with respect to timeline, we want to concentrate on the simple solution of
pausing MHCs completely.
- Use node leases: there are efforts to introduce the concept of node leases
(https://github.com/kubernetes/enhancements/pull/1411). They can be used for coordinating parties which want to do
destructive actions on nodes, like e.g. upgrades or remediation. This would also result in more fine-grained "pausing"
of node remediation. With respect to the timeline we don't want to wait for this effort, but will keep an eye on it.

## Infrastructure Needed [optional]

n/a