-
Notifications
You must be signed in to change notification settings - Fork 541
Propose new controller for pausing MHC during cluster upgrades #834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
941f0b1
ffefbad
9a40070
d3b9bbb
f32a8de
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,201 @@ | ||
| --- | ||
| title: machine-health-check-pause-on-cluster-upgrade | ||
|
|
||
| authors: | ||
| - @slintes | ||
|
|
||
| reviewers: | ||
| - @JoelSpeed | ||
| - @michaelgugino | ||
| - @beekhof | ||
|
|
||
| approvers: | ||
| - @JoelSpeed | ||
| - @michaelgugino | ||
| - @beekhof | ||
|
|
||
| creation-date: 2021-07-14 | ||
| last-updated: 2021-07-14 | ||
| status: implementable | ||
|
|
||
| see-also: | ||
| - "enhancements/machine-api/machine-health-checking.md" | ||
| --- | ||
|
|
||
| # Pause MachineHealthChecks During Cluster Upgrades | ||
|
|
||
| ## Release Signoff Checklist | ||
|
|
||
| - [ ] Enhancement is `implementable` | ||
| - [ ] Design details are appropriately documented from clear requirements | ||
| - [ ] Test plan is defined | ||
| - [ ] Operational readiness criteria is defined | ||
| - [ ] Graduation criteria for dev preview, tech preview, GA | ||
| - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
|
||
| ## Summary | ||
|
|
||
| Automatically prevent unwanted remediation during cluster upgrades caused by temporarily unhealthy nodes. | ||
|
|
||
| ## Motivation | ||
|
|
||
| During cluster upgrades nodes get temporarily unhealthy, which might trigger remediation when a machineHealthCheck | ||
| resource is targeting those nodes, and they don't get healthy before the configured timeout. This remediation delays | ||
| the cluster upgrade significantly especially on bare metal clusters, and causes needless interruption of customer | ||
| workloads while they are transferred to other nodes. | ||
|
|
||
| ### Goals | ||
|
|
||
| - Automatically pause all machineHealthChecks during cluster upgrades | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does OCP upgrade nodes one by one?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's try walking before we attempt to run There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't contradicts.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added this to non-goals and alternatives
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you actually care about the machineConfigPool being in progressing state rather than the upgrade state. A full hour of the upgrade has nothing to do with host reboots and if we deliver on paused-worker-pool upgrades most nodes in a cluster won't even reboot some of the time. I think you should instead watch for nodes that have been cordoned and ignore them in some manner. That seems more broadly applicable. @JoelSpeed do you have an opinion on this?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree, if we can track the MCP upgrade rather than whole cluster upgrade, that will be a smaller time window. Perhaps the easiest way to do this is to actually track the
This seems more like the node maintenance lease idea that has floated around before, kubernetes/enhancements#1411, this is definitely something that could be more intuitive and avoid the dependency on upgrade specific watching, but is a much larger piece of work. In terms of just watching cordoned nodes, to me this makes sense as a signal that a node is going into maintenance of some description. Need to think a bit more about if this would cause other issues 🤔 |
||
| - Automatically unpause all machineHealthChecks when cluster upgrade finished, in case they were paused for the upgrade | ||
| - Get a simple solution done in time for OCP 4.9, by using what we have upstream already (the pause annotation), for | ||
| avoiding worst case behaviour during cluster upgrades. Look for better solutions for future releases | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| - Track the status of each single node (upgrading or not upgrading) for more fine-grained pausing of remediation | ||
|
|
||
| ## Proposal | ||
|
|
||
| Introduce a new controller, which watches clusterVersion and machineHealthCheck resources, and pauses all unpaused | ||
| machineHealthChecks when the clusterVersion status indicates an ongoing cluster upgrade. The controller will unpause the | ||
| machineHealthChecks, when the clusterVersion status indicates that the upgrade process finished, and the | ||
| machineHealthChecks were paused by this controller. | ||
|
|
||
| ### User Stories | ||
|
|
||
| - As a cluster admin I don't want unneeded remediation/reboots of nodes during cluster upgrades | ||
| - As a cluster admin I don't want to manually pause machineHealthChecks during cluster upgrades | ||
| - As a user I don't want my workloads to be transferred to new nodes more often than needed during cluster upgrades | ||
|
|
||
| ### Implementation Details/Notes/Constraints [optional] | ||
|
|
||
| The clusterVersion object has several conditions in its status. Example: | ||
|
|
||
| ```yaml | ||
| status: | ||
| conditions: | ||
| - lastTransitionTime: "2021-07-05T20:53:02Z" | ||
| message: Done applying 4.9.0-0.nightly-2021-07-05-092650 | ||
| status: "True" | ||
| type: Available | ||
| - lastTransitionTime: "2021-07-06T12:03:02Z" | ||
| status: "False" | ||
| type: Failing | ||
| - lastTransitionTime: "2021-07-05T20:53:02Z" | ||
| message: Cluster version is 4.9.0-0.nightly-2021-07-05-092650 | ||
| status: "False" | ||
| type: Progressing | ||
| - lastTransitionTime: "2021-07-05T20:05:15Z" | ||
| message: 'Unable to retrieve available updates: currently reconciling cluster | ||
| version 4.9.0-0.nightly-2021-07-05-092650 not found in the "stable-4.8" channel' | ||
| reason: VersionNotFound | ||
| status: "False" | ||
| type: RetrievedUpdates | ||
| ... | ||
| ``` | ||
|
|
||
| The machineHealthCheck supports pausing by setting an annotation on it: | ||
|
|
||
| ```yaml | ||
| apiVersion: machine.openshift.io/v1beta1 | ||
| kind: MachineHealthCheck | ||
| metadata: | ||
| name: example | ||
| namespace: openshift-machine-api | ||
| annotations: | ||
| cluster.x-k8s.io/paused: "" | ||
| ... | ||
| ``` | ||
|
|
||
| When the clusterVersion's `Progressing` condition switches to `"True"`, the new controller will iterate all | ||
| machineHealthCheck resources, and add the pause annotation to all of them which are not paused already. For tracking | ||
| which resources are being paused by the controller, it will also add a `machine.openshift.io/paused-for-upgrade: ""` | ||
| annotation. | ||
|
|
||
| During the cluster upgrade, the controller will also pause and mark all newly created machineHealthChecks. | ||
| When a machineHealthCheck is unpaused by another party though, the controller will not pause it again, in order | ||
| to allow admins to explicitly start remediation during upgrades. | ||
|
|
||
| When the clusterVersion condition switches back to `"False"`, the controller will remove both annotations from all | ||
| machineHealthChecks, which were paused by this controller. | ||
|
|
||
| The new controller will be implemented in the `openshift/machine-api-operator` repository, and be deployed in the | ||
| `machine-api-controllers` pod, both alongside the machine-healthcheck-controller. | ||
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| Hosts that are not successfully rebooted as part of the upgrade won't be remediated, until either the upgrade finishes, | ||
| or an admin unpauses the machineHealthChecks manually. | ||
|
|
||
| ## Design Details | ||
|
|
||
| ### Open Questions [optional] | ||
|
|
||
| 1. Should admins be able to enable / disable this feature? (IMHO no, I don't see a usecase when this has disadvantages) | ||
|
|
||
| 2. Today there is no way to coordinate multiple users of the pause feature of machineHealthChecks. The value of the | ||
| annotation is supposed to be empty. This means that the annotation might already be set by someone else when the new | ||
| controller want to set it as well. That introduces the question of what to do in this case after cluster upgrade: remove | ||
| the annotation, or keep it? In the latter case: how to keep track whether the controller set the annotation or not (e.g. | ||
| in another annotation?) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a user has already paused remediation, I don't think we want an upgrade to remove that setting. I think we should use a secondary annotation to indicate that this automated controller has added the pause so that it knows to only remove the pause if it sees both annotations.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is this documented somewhere? Is this enforced anywhere? I had imagined we would set the value to the owner of the pause, as it were, so the upgrade controller would put its name as the value, and then only remove the annotation if the value matches (this would allow others to pause using their own value as well)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
ack, agree, will update
No and no, but that was the answer in the sig cluster api Slack when I asked about it. |
||
|
|
||
| ### Test Plan | ||
|
|
||
| TODO (need to check existing e2e tests, and if we extend them with a test case for this, and how much extra load to CI | ||
| it would add) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add a test to an existing job?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can add this to the existing MHC tests that exist in cluster-api-actuator-pkg |
||
|
|
||
| ### Graduation Criteria | ||
|
|
||
| TODO I have no good idea yet | ||
| - if this is alpha, beta or stable (do we need this at all, it doesn't introduce a new API)? | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The annotation is technically an API, right?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, the annotation is an API IMO
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, but this enhancement is not about adding that annotation, it's about using it only |
||
| - if this is Dev Preview, Tech Preview or GA? | ||
|
|
||
| #### Dev Preview -> Tech Preview | ||
|
|
||
| #### Tech Preview -> GA | ||
|
|
||
| #### Removing a deprecated feature | ||
|
|
||
| ### Upgrade / Downgrade Strategy | ||
|
|
||
| The new controller lives in the machine-api-operator image, so the upgrades will be driven by the CVO which will fetch | ||
| the right image version as usual. | ||
| The controller does not offer any API or configuration. The controller can be replaced with an older or newer | ||
| version without side effects or risks. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If health checks are paused on a cluster before the upgrade to this version is started, then when the new controller comes online and sees the upgrade complete it will erase the pause annotations. I don't think we want that, because it won't leave the cluster in the same state as it was when we started and the change could cause unwanted degradation in service. See the comment above about the 2-annotation approach.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
when everything goes as planned, it is very very unlikely that anyone added the pause annotation to machineHealthChecks already, because the machine-healthcheck-controller version, which respects the pause annotation (and so the relevant doc update), will only be introduced with the same OCP / machine-api-operator version as this new controller. But I will keep this in mind in case this does not go as planned, thanks. |
||
|
|
||
| ### Version Skew Strategy | ||
slintes marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| This controller has some expectation to other APIs, which are the clusterVersion status conditions, and the | ||
| machineHealthCheck pause annotation. In case any of those changes their API (new condition name or annotation key), the | ||
| controller will stop to work. This will be caught in CI. | ||
|
|
||
| ## Implementation History | ||
|
|
||
| ## Drawbacks | ||
slintes marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The controller can't stop ongoing external remediation, which is executed by an external remediation controller. For | ||
| this to happen we would either need to introduce and implement a pause flag on the external remediation CRD, or the | ||
| external remediation controller needs to look for the pause annotation on the machineHealthCheck resource. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would happen if the remediation controller indicates that the cluster is not upgradeable when it starts trying to fix a host?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I can't follow. What do you mean with "the remediation controller indicates that the cluster is not upgradeable"? |
||
|
|
||
| ## Alternatives | ||
slintes marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| - Not having this feature at all: cluster admins could pause machineHealthChecks manually, but that is unneeded work, | ||
| and too easy to forget. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It could be scripted, right? How widely used is the machine health check controller? Is that an optional component or is it always present when we have the machine API?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. MHC comes by default in an OCP cluster, but it isn't configured unless a customer creates a MHC object in the cluster, so it's technically off by default |
||
| - Implement this in the machine-healthcheck-controller: this would be an obvious choice on a first look. But since the | ||
| clusterVersion resource is Openshift specific, it would further diverge Openshift machine API and upstream Cluster | ||
| API. The goal is to align them over time. | ||
| - Track the upgrading status on single nodes instead of the cluster: on one hand this would allow to remediate hosts | ||
| which are unhealthy but not caused by the upgrade, but on the other hand it might have the risk that both upgrade and | ||
| remediation drain nodes, and the cluster runs out of capacity. This can result in upgrade and remediation controllers | ||
| blocking each other from taking action, and the cluster wouldn't be upgradeable or fixable. This might be investigated | ||
| more deeply in the future, but for now, with respect to timeline, we want to concentrate on the simple solution of | ||
| pausing MHCs completely. | ||
| - Use node leases: there are efforts to introduce the concept of node leases | ||
| (https://github.com/kubernetes/enhancements/pull/1411). They can be used for coordinating parties which want to do | ||
| destructive actions on nodes, like e.g. upgrades or remediation. This would also result in more fine-grained "pausing" | ||
| of node remediation. With respect to the timeline we don't want to wait for this effort, but will keep an eye on it. | ||
|
|
||
| ## Infrastructure Needed [optional] | ||
|
|
||
| n/a | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a problem we've seen in practice? Do we know the cause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came up with a customer case, so we have seen it in the wild.
The issue was that the MCO reboot took longer than the configured timeout for the MHC. I don't know the full details of that MHC but I'm assuming they had a reasonable timeout configured. We should double check this and add some context here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that just misconfiguration, I configured a MHC without properly understanding the upper bound of the time required to reboot a host and the Kubelet to start posting Ready? Should our docs around adding MHCs make it clear you'll probably want to measure the time it takes your hosts to perform a MCO initiated reboot before setting these values?
Do we really need different rules applied during upgrades versus non upgrade times? If I apply an MC change that triggers a reboot that wouldn't be during an upgrade but would trigger the same behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm kind of leaning this direction myself. If you hosts typically take 30 minutes to reboot, set the MHC timeout to 45 minutes. IMO, MHC isn't meant for 'emergency' situations where a host has to be replaced ASAP. It's more of a nice-to-have when you have programmable infra, so if a host dies in the middle of the night, it will be replaced and you won't slowly lose machines over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point 👍🏼 We could catch this by looking at machineConfigPools
MachineConfigPoolUpdatingcondition though (suggested at another place already) instead of the cluster version's status, right?That's also a fair point, but I know we have users / customers which want fencing as fast as possible.
@beekhof @JoelSpeed WDYT?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only true for the compute capacity of the unhealthy node but not for the workload that was assigned to that node.
if you have stateful workloads running on the unhealthy node, you'll have downtime until remediation takes place.
45 minutes downtime for a user application is a lot, IMO, and we saw this need coming from customers comparing OCP to other products.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If MHC is disabled for one reason or another during upgrades, this situation could persist indefinitely (eg, you're upgrading host 1, host 2 dies, problem upgrading host 1). Similarly, PDBs can prevent MHC from successfully deleting a failed node indefinitely.
Let's try to put some context around the MHC usecase/userstory. If you weren't running MHC, and you had a random worker go down in the middle of the night, you'd need to wait for X minutes to be paged, someone to log in, etc, and decide to remediate the node manually. That's probably going to take 30 minutes or so, and that's only if you decide to page for a single node failure or have 24/7 ops.
MHC is not a cluster saving or application saving utility. It's to slowly replace failed nodes. Pretty much everywhere but baremetal won't encounter this issue anyhow (other platform reboots are with a couple minutes), so like I said elsewhere, if we're going to add something like this, it should be opt-in.