Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
328 changes: 328 additions & 0 deletions enhancements/update/reduced_reboots.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
---
title: reduced-reboot-upgrades
authors:
- "@sdodson"
reviewers:
- @darkmuggle
Copy link
Contributor

@darkmuggle darkmuggle Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- @darkmuggle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alex will be driving the MCO side of this discussion.

- @rphillips
- @derekwaynecarr
- @crawford
- @dcbw
- @miabbott
- @mrunalp
- @zvonkok
- @pweil-
- @wking
- @vrutkovs
approvers:
- @derekwaynecarr
- @crawford
creation-date: 2020-01-21
last-updated: 2020-01-21
status: provisional
see-also:
- https://github.com/openshift/enhancements/pull/585
- "/enhancements/eus-mvp.md"

---

# Reduced Reboot Upgrades

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Operational readiness criteria is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

This enhancement is intended to reduce host reboots when upgrading across two or
more OpenShift minor versions by enabling an N-2 version skew policy between all
host components and cluster scoped resources.

## Motivation

While OpenShift is designed to minimize workload disruption and risk associated
with rolling reboots there exist a class of customers and workloads where
reboots remain a disruptive and time consuming activity. Additionally, with the
introduction of Extended Update Support (EUS) a new upgrade pattern will emerge
where clusters run 4.6 for a year or more then rapidly upgrade across multiple
minor versions in a short period of time. Those customers wish to complete their
upgrades in a condensed time frame and with as few reboots as possible, they do
not intend to run each minor version for an extended period of time.

### Goals

- Define testing requirements for N-2 host to cluster resource version skew
- Define version skew policies for host and cluster scoped resources
- Reduce reboots in accordance with our new tested policies

### Non-Goals

- Exceeding upstream's documented Kubelet version skew policies

## Proposal

### User Stories

#### Node - Improve Upstream Kubelet Version Skew Testing

Kubernetes defines a version skew policy(https://kubernetes.io/docs/setup/release/version-skew-policy/#kubelet)
which allows for kubelet N-2 to be compatible with kube-apiserver version N. At
this point in time OpenShift is not comfortable with the level of testing upstream
and the intersection of the specific features of OpenShift. We should work to
define and implement upstream testing changes which give us an appropriate level
of confidence that N-2 version skew issues would be identified in the community
whenever possible.
Comment on lines +73 to +79
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is not currently verified in upstream CI -- I raised this for the SIG Architecture agenda tomorrow to ensure that this has test coverage or to clarify the support policy.


#### OTA - Implement Downstream Paused Worker Pool Upgrade Tests

In parallel with efforts to revamp upstream version skew testing we must also
implement downstream version skew testing which includes any additional tests
required for OpenShift specific implementation details.

We will achieve this by delivering upgrade jobs which pause the Worker MachineConfigPool
then upgrade from 4.x to 4.x+1 to 4.x+2. We will run conformance tests from 4.x
after upgrading to 4.x+1 in order to ensure that we continue to provide a baseline
feature set, then again after upgrading to 4.x+2, finally after unpausing the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on why we couldn't run the 4.x+1 suite on the cluster once the control-plane reached 4.x+1. It seems unlikely to me that the test suite is tightly tied to kubelet-side features, and that when it is, it requires 4.x+1 compute nodes. The distinction isn't critical, because leaving a pool paused long enough to port workloads to new APIs is unwise, but running the tests that match the current control plane seems convenient if it works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not a straight forward question to answer. We know that under our current upgrade process most operators have to tolerate some incomplete upgrade state without exploding because MCO updates nodes (kubelets) toward the very end of the upgrade, but dodging problems there isn't quite the same as running a full test suite. At the same time it's not truly a 4.x+1 cluster so if we do choose to run the tests we should not be surprised if there's some portion which leverage new features that fail.

I have a feeling that where we'll run into problems is when we upgrade the control plane to 4.x+2 but we still have kubelets at 4.x. This is where we'll likely run into operators which have always assumed a baseline feature set of 4.x+1.

Perhaps node team have an opinion here? @rphillips @ehashman @harche

Worker MCP we will run 4.x+2 tests.

Given the complexity of these test jobs we should expect that they may take
longer than the current four hour limit for test jobs. Rather than compromising
on test completeness we will seek to extend test duration limits or find other
ways to meet these testing demands.

#### Teams with Host Components - Allow N-2 Host Component Version Skew

All teams which own components that directly interface with or ship host based
components will need to ensure that they're broadening their compatibility to
allow for N-2 version skew between host and cluster scoped resources.

This would include for example the SDN DaemonSets in 4.10 remaining compatible
with OVS and any other host components in 4.10, 4.9, and 4.8. On a case by case
basis teams should decide whether it makes more sense to maintain a broader
compatibility matrix or that N-1 bits and MachineConfig are backported to N-2 and
upgrade graph is amended with these new minimum version requirements.

For instance, if 4.9.12 is the minimum version for 4.9 to 4.10 upgrades we'd
ensure that the next 4.8.z shipping after 4.9.12 has RHCOS bits and MachineConfig
which offer parity with 4.9.12 so that it's not required that we reboot into
4.9.12. If teams choose to pursue this option they will need to continue to ensure
that 4.7 to 4.8.z and 4.8.z-n to 4.8.z upgrades continue to work as well.

Teams which believe this is not achievable or the level of effort is extremely
high should document those findings.

Thus far RHCOS, Node, MCO, SDN, Containers, and PSAP teams are known to fall into
this group of teams which have components coupled to the host.

#### MCO - Widen Node Constraints to allow for N-2

Building upon the EUS-to-EUS upgrade MVP work(https://github.com/openshift/enhancements/blob/master/enhancements/update/eus-upgrades-mvp.md#mco---enforce-openshifts-defined-host-component-version-skew-policies)
to allow MCO to enforce host constraints we will broaden those constraints to
enable host component version skew.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MCO, will set Upgradeable=False whenever any MachineConfigPool has one more more nodes present which fall outside of a defined list of constraints.

Where can I see the complete list of constraints?

The MCO is not responsible for defining these constraint

Reading both enhancements I see now the following constraints:

  1. MachineConfigPool is paused, MCO sets Upgradeable=False?
  2. OpenShift defines the Kubelet version skew to (N-1 or N-2)
  3. OLM inclusive range
  4. OLM maxKube or maxOCP -> OLM sets Upgradeable=False

Are other entities/operators also allowed to set Upgradeable=False on self-defined constraints like, my kernel module will not build on kernel version x.y.z that is coming with update 4.y.z? For some drivers I may know it for some I cannot know it.

Copy link

@zvonkok zvonkok Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another constraint to consider would be maxUnavailable: 0 in an MCP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're iterating on design with MCO, the original proposal to examine node specific details like kubelet version or kernel version may not be accepted in favor of something more abstract like the version of MachineConfig templates currently applied to nodes being greater than or equal to some version. At very high level we're just seeking to ensure that we enforce whatever host component version skew policy we come up with by setting Upgradeable=False to prevent upgrades that would violate those policies.

In 1) above MCO would only set Upgradeable=False when those constraints would be violated through a minor version upgrade, within this context we don't actually care if the pools are paused just that they're of some minimum version.

All operators are enabled to set Upgradeable=False, it's the primary mechanism used by operators to assert that the cluster is not in a state which would allow for minor version upgrades.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdodson That is fine, for such information we have an RFE for the machine-os-content to provide such information via annotations as part of the extension system. Just trying to understand the high-level constraints and the best way to "guard" special resources in a cluster from unintended upgrades.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotations are needed to support #357 https://issues.redhat.com/browse/GRPA-2713

Copy link

@zvonkok zvonkok Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgradeable=False Where is this exactly set if one of the constraints are not met, in the conditions of the ClusterOperator object for MCO?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdodson By all operators you mean operators that are managed via CVO and OLM? I had a question on my enhancement if OLM managed operators are allowed to create a ClusterOperator object to "control" upgradability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a question on my enhancement if OLM managed operators are allowed to create a ClusterOperator object to "control" upgradability.

I didn't think this was possible before, but turns out I was wrong, and the CVO uses an unfiltered list of all ClusterOperator objects, regardless of whether they are from CVO manifests or added by third parties, when it calculates whether minor bumps are blocked.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking Thanks for pointing this out, started to read the code as well and doing some tests with custom MCPs and custom ClusterOperator objects.

Admins who choose to would then be able to skip a host reboot by following this
pattern:

1. Starting with a 4.8 cluster, pause Worker MachineConfigPool
1. Upgrade to 4.9
1. Upgrade to 4.10
1. Unpause Worker MachineConfigPool
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about custom MachineConfigPools?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example would be e.g. real-time kernel workers that are in a separate MCP with additional MachineConfigs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All pools should inherit from worker, so if users have custom pools these would be updated when MCO generates a new rendered config for worker on every step. Custom pools needs to be paused as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do those inherit from Worker pool? This does make me wonder if we need to define our policies on a per pool basis, especially if we're considering that control plane pool must be updated.

@darkmuggle another point to consider here.

Copy link

@zvonkok zvonkok Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok just for clarification, if I pause a custom MCP this property will not be backpropagated to the parent worker MCP. MCO will rollout the upgrade on the worker MCP and since my custom MCP is paused it will wait until availability?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom pools needs to be paused as well

I don't think any pools need to be paused. Folks interested in reducing reboots can just pause the pools in which they want to reduce reboots, for the short time it takes to make a few consecutive update hops. I don't understand how pool-pausing interacts between the worker pool and its custom decedents; maybe the MCO folks can speak to that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users could have self-defined constraints that are not yet applied or implemented in a piece of software. Either an operator is preventing the upgrade or an admin wishes to do some "day 2" customizations before upgrading, knowing that an upgrade could break something.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about workloads that are running longer (days, weeks) that cannot be checkpointed, wouldn't this also be a reason to pause an MCP? PodDisruptionBudgets with minAvailable=100% would prevent draining to finish but the Node would already be cordoned. The workload may need additional Pods to be scheduled on the node to finish the task.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodDisruptionBudgets with minAvailable=100% would prevent draining to finish but the Node would already be cordoned.

There is a balance between wasting resources by not scheduling work on the node while you wait for the slow pod to wrap up, and getting stuck in an endless drain because the new work you let onto the node ends up also being guarded by a PDB or other mechanism and taking longer than the initial slow pod. Currently we weight in favor of quickest-possible-drain at the expense of underutilizing the available capacity. Folks who want to minimize wasted capacity can set maxUpgradeable: 1 (the default) on their MachineConfigPools, although the net effect is just spreading the waste out over a longer wall-clock duration as the nodes roll one after the other. That spreading may still be useful if the overall pool doesn't have much extra capacity to spare. And if folks feel really strongly, they can arrange to fill each node with workloads that will all wrap up around the same point in time, or twiddle tolerations to allow faster work onto the cordoned node until the slow work gets closer to wrapping up.

The workload may need additional Pods to be scheduled on the node to finish the task.

It's up to the actor managing that workload to either say "you know what, we should clear this slow work off so that poor node can update" or "I am unwilling to abandon this slow work, so I'm going to set sufficient tolerations and PDB guards on these new Pods so they can go join the slow worker on the cordoned node and help it push through to completion". I don't think that's something generic OpenShift tooling that doesn't understand the details of your workload can help you with out of the box. In both of these cases, pausing the pool is one way to reduce the likelihood of cordons and node reboots. But MachineConfigs are not the only reason your nodes might get cordoned or rolled. I think it's better for folks who have really expensive, slow work to write their own really smart controllers who can shepherd the work through a generic Kube-node environment instead of using the coarse pool-pausing knob to try to orchestrate the dance they want to see.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking Right, we do not have to lengthen this. Either you want to update or you do not want to. If you want to update then make a plan on how to fit the upgrade procedure into your daily/weekly business. We can offer the tools and mechanics but "you" should do "your" homework.


Note that this can be decoupled in a way that when we ship 4.9 the initial MCO
could assert constraints which require 4.9 host components before upgrading to
4.10. Then after we ship 4.10 and have sufficiently tested a 4.8 to 4.10 host
component version skew a later 4.9.z MCO could have its constraints broadened.
This allows us additional time to test broader version skews if we so choose.


### Implementation Details/Notes/Constraints [optional]

What are the caveats to the implementation? What are some important details that
didn't come across above. Go in to as much detail as necessary here. This might
be a good place to talk about core concepts and how they relate.

### Risks and Mitigations

This imposes significant risk due to a number of factors:
- We're currently not confident in upstream's testing matrix and our specific
feature sets
- We've never before expected teams to offer broader than N-1 compatibility.
Teams have always assumed at most N-1 and even then, especially early after a
GA release, it's not uncommon to find problems in N-1 compatibility.
- While N-1 is tested in the context of upgrades it's not been tested in long
term use.
- If external integrations depend on host components of all nodes having been
updated then we'll run into problems. For instance, if there's an upgrade
scenario where RHV cloud provider integration needs to be upgraded between 4.6
and 4.10 in order to ensure compatibility and the components which interface
with RHV are components of RHCOS then we may not upgrade those components at
the same minor version expected previously.

We may mitigate some of this by further delaying EUS-to-EUS upgades after normal
minor version upgrades have been promoted to stable and allocating significantly
more time and effort to testing efforts. Ultimately this introduces another
dimension to an already complex testing matrix.

## Design Details

### Open Questions [optional]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent an upgrade of specific nodes, the high-value assets of a cluster with special resources and constraints (kernel, os, ... ), one could create a custom MCP and set either pause: true or maxUnavailable: 0 this would prevent MCO from updating this very MCP for the next 1 or 2 release depending on the constraint.

An operator that does preflight checks on newer kernel, os could use the new information that is rolled out to the other nodes to check if the special resource would work on the update.

Copy link
Member Author

@sdodson sdodson Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if an operator has tighter constraints than can be expressed by either minimum version of MachineConfig, kubelet version, or kernel then that logic should live outside of the MCO. The MCO is caring about this only because it can affect the host component versions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdodson Understood, this is related to this: #357 my questions around this is, how SRO can prevent upgrades and guard the special resources from unwanted upgrades. Some kmods are sensitive to any kernel version change and some kmods (kABI whitelisted symbols) only care of about OS major changes (8.x -> 9.x) .


1. Should we make these between specific named versions. ie: 4.6-4.8, and 4.8-4.10
or should this be a standard N-2 rule, ie: 4.6-4.8, 4.7-4.9, 4.8-4.10?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with an MCP which is "never" unpaused? Will MCO force an upgrade if the MCP violates the constraint N-1 or N-2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we'd just inhibit upgrades that fall outside of our defined policy. That can still be forced around but anytime an admin chooses to force an upgrade they become responsible for the disaster they create.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never unpausing a pool will sooner or later fail the nodes out as the kubelets stop getting Kube-API CA rotations, even if the admins don't force an update. We should be very clear that pool pauses are acceptable practice for a few hours, maybe into days, but that the whole time pools are paused, you are not getting CA rotations, bumped pull secrets, RHCOS bugfixes, and all the other good stuff that comes with having an unpaused pool. Folks who want to leave a pool paused for longer should be considering all of those consequences, and deciding for themselves if the reduced disruption is worth the risk.

### Test Plan

This is actually a major focus of the entire effort, so we'll fill this out
now but expect to bring more clarity in the future once we have a better test
plan.

- We must have upstream N-2 version skew testing, which test suites should
be run at completion? e2e?
- We must have downstream N-2 version skew testing which meets or exceeds our
existing upgrade testing. We need to decide if this is install OCP 4.N and
RHCOS 4.N-2 or if this is install OCP 4.N-2 pause Worker MCP, upgrade twice, test.
The former will be quicker but the latter will be more representative of the
customer use case.
- We must decide how many platforms must be covered, all of them? tier 1?

### Graduation Criteria

**Note:** *Section not required until targeted at a release.*

Define graduation milestones.

These may be defined in terms of API maturity, or as something else. Initial proposal
should keep this high-level with a focus on what signals will be looked at to
determine graduation.

Consider the following in developing the graduation criteria for this
enhancement:

- Maturity levels
- [`alpha`, `beta`, `stable` in upstream Kubernetes][maturity-levels]
- `Dev Preview`, `Tech Preview`, `GA` in OpenShift
- [Deprecation policy][deprecation-policy]

Clearly define what graduation means by either linking to the [API doc definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning),
or by redefining what graduation means.

In general, we try to use the same stages (alpha, beta, GA), regardless how the functionality is accessed.

[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/

**Examples**: These are generalized examples to consider, in addition
to the aforementioned [maturity levels][maturity-levels].

#### Dev Preview -> Tech Preview

- Ability to utilize the enhancement end to end
- End user documentation, relative API stability
- Sufficient test coverage
- Gather feedback from users rather than just developers
- Enumerate service level indicators (SLIs), expose SLIs as metrics
- Write symptoms-based alerts for the component(s)

#### Tech Preview -> GA

- More testing (upgrade, downgrade, scale)
- Sufficient time for feedback
- Available by default
- Backhaul SLI telemetry
- Document SLOs for the component
- Conduct load testing

**For non-optional features moving to GA, the graduation criteria must include
end to end tests.**

#### Removing a deprecated feature

- Announce deprecation and support policy of the existing feature
- Deprecate the feature

### Upgrade / Downgrade Strategy

Upgrade expectations:
- **NEW** This is new, the rest is the standard boiler plate which still applies --
Admins may pause Worker MachineConfig pools at specifically defined product versions
then apply multipe minor version upgrades before having to un-pause the MCP in
order to upgrade to the next minor.
- Each component should remain available for user requests and
workloads during upgrades. Ensure the components leverage best practices in handling [voluntary disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to this should be
identified and discussed here.
- Micro version upgrades - users should be able to skip forward versions within a
minor release stream without being required to pass through intermediate
versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1`
as an intermediate step.
- Minor version upgrades - you only need to support `x.N->x.N+1` upgrade
steps. So, for example, it is acceptable to require a user running 4.3 to
upgrade to 4.5 with a `4.3->4.4` step followed by a `4.4->4.5` step.
- While an upgrade is in progress, new component versions should
continue to operate correctly in concert with older component
versions (aka "version skew"). For example, if a node is down, and
an operator is rolling out a daemonset, the old and new daemonset
pods must continue to work correctly even while the cluster remains
in this partially upgraded state for some time.

Downgrade expectations:
- When downgrading in conjunction with reboot avoidance as described in this
enhancement it is assumed that you will rollback at most one minor version, if
you had upgraded 4.8 to 4.9 to 4.10 then you would only be able to downgrade
back to 4.9.
- If you had paused MachineConfigPools they should remain paused. If you had
unpaused MachineConfigPools then those should remain unpaused when rolling back
so that host bound components similarly downgrade.
- If an `N->N+1` upgrade fails mid-way through, or if the `N+1` cluster is
misbehaving, it should be possible for the user to rollback to `N`. It is
acceptable to require some documented manual steps in order to fully restore
the downgraded cluster to its previous state. Examples of acceptable steps
include:
- Deleting any CVO-managed resources added by the new version. The
CVO does not currently delete resources that no longer exist in
the target version.

### Version Skew Strategy

We will need to extensively test this new host to cluster scoped version skew.
For the time being we will only allow version skew across specific versions,
4.8 to 4.10. This should be enforced via MCO mechanisms defined in a previous
enhancement(https://github.com/openshift/enhancements/blob/master/enhancements/update/eus-upgrades-mvp.md#mco---enforce-openshifts-defined-host-component-version-skew-policies).

Components which ship or interface directly with host bound components must ensure
that they've tested across our defined version skews.

Consider the following in developing a version skew strategy for this
enhancement:
- During an upgrade, we will always have skew among components, how will this impact your work?
- Does this enhancement involve coordinating behavior in the control plane and
in the kubelet? How does an n-2 kubelet without this feature available behave
when this feature is used?
- Will any other components on the node change? For example, changes to CSI, CRI
or CNI may require updating that component before the kubelet.

## Implementation History

Major milestones in the life cycle of a proposal should be tracked in `Implementation
History`.

## Drawbacks

This introduces significant additional compatibility testing dimensions to much
of the product. We should strongly consider whether reducing reboots by 50% is
worth it.

## Alternatives

- We make no changes to our host component version skew policies.
- We find other ways to reduce workload disruption without expanding our compatibility testing matrices.

## Infrastructure Needed [optional]

This effort will expand our CI requirements with additional test jobs which must
run at least once a week if not daily. Otherwise there's no net new projects or
repos expected.