-
Notifications
You must be signed in to change notification settings - Fork 533
Add Reduced Reboots enhancement #643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,328 @@ | ||
| --- | ||
| title: reduced-reboot-upgrades | ||
| authors: | ||
| - "@sdodson" | ||
| reviewers: | ||
| - @darkmuggle | ||
| - @rphillips | ||
| - @derekwaynecarr | ||
| - @crawford | ||
| - @dcbw | ||
| - @miabbott | ||
| - @mrunalp | ||
| - @zvonkok | ||
| - @pweil- | ||
| - @wking | ||
| - @vrutkovs | ||
| approvers: | ||
| - @derekwaynecarr | ||
| - @crawford | ||
| creation-date: 2020-01-21 | ||
| last-updated: 2020-01-21 | ||
| status: provisional | ||
| see-also: | ||
| - https://github.com/openshift/enhancements/pull/585 | ||
| - "/enhancements/eus-mvp.md" | ||
|
|
||
| --- | ||
|
|
||
| # Reduced Reboot Upgrades | ||
|
|
||
| ## Release Signoff Checklist | ||
|
|
||
| - [ ] Enhancement is `implementable` | ||
| - [ ] Design details are appropriately documented from clear requirements | ||
| - [ ] Test plan is defined | ||
| - [ ] Operational readiness criteria is defined | ||
| - [ ] Graduation criteria for dev preview, tech preview, GA | ||
| - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
|
||
| ## Summary | ||
|
|
||
| This enhancement is intended to reduce host reboots when upgrading across two or | ||
| more OpenShift minor versions by enabling an N-2 version skew policy between all | ||
| host components and cluster scoped resources. | ||
|
|
||
| ## Motivation | ||
|
|
||
| While OpenShift is designed to minimize workload disruption and risk associated | ||
| with rolling reboots there exist a class of customers and workloads where | ||
| reboots remain a disruptive and time consuming activity. Additionally, with the | ||
| introduction of Extended Update Support (EUS) a new upgrade pattern will emerge | ||
| where clusters run 4.6 for a year or more then rapidly upgrade across multiple | ||
| minor versions in a short period of time. Those customers wish to complete their | ||
| upgrades in a condensed time frame and with as few reboots as possible, they do | ||
| not intend to run each minor version for an extended period of time. | ||
|
|
||
| ### Goals | ||
|
|
||
| - Define testing requirements for N-2 host to cluster resource version skew | ||
| - Define version skew policies for host and cluster scoped resources | ||
| - Reduce reboots in accordance with our new tested policies | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| - Exceeding upstream's documented Kubelet version skew policies | ||
|
|
||
| ## Proposal | ||
|
|
||
| ### User Stories | ||
|
|
||
| #### Node - Improve Upstream Kubelet Version Skew Testing | ||
|
|
||
| Kubernetes defines a version skew policy(https://kubernetes.io/docs/setup/release/version-skew-policy/#kubelet) | ||
| which allows for kubelet N-2 to be compatible with kube-apiserver version N. At | ||
| this point in time OpenShift is not comfortable with the level of testing upstream | ||
| and the intersection of the specific features of OpenShift. We should work to | ||
| define and implement upstream testing changes which give us an appropriate level | ||
| of confidence that N-2 version skew issues would be identified in the community | ||
| whenever possible. | ||
|
Comment on lines
+73
to
+79
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that this is not currently verified in upstream CI -- I raised this for the SIG Architecture agenda tomorrow to ensure that this has test coverage or to clarify the support policy. |
||
|
|
||
| #### OTA - Implement Downstream Paused Worker Pool Upgrade Tests | ||
|
|
||
| In parallel with efforts to revamp upstream version skew testing we must also | ||
| implement downstream version skew testing which includes any additional tests | ||
| required for OpenShift specific implementation details. | ||
|
|
||
| We will achieve this by delivering upgrade jobs which pause the Worker MachineConfigPool | ||
| then upgrade from 4.x to 4.x+1 to 4.x+2. We will run conformance tests from 4.x | ||
| after upgrading to 4.x+1 in order to ensure that we continue to provide a baseline | ||
| feature set, then again after upgrading to 4.x+2, finally after unpausing the | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not clear on why we couldn't run the 4.x+1 suite on the cluster once the control-plane reached 4.x+1. It seems unlikely to me that the test suite is tightly tied to kubelet-side features, and that when it is, it requires 4.x+1 compute nodes. The distinction isn't critical, because leaving a pool paused long enough to port workloads to new APIs is unwise, but running the tests that match the current control plane seems convenient if it works.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, it's not a straight forward question to answer. We know that under our current upgrade process most operators have to tolerate some incomplete upgrade state without exploding because MCO updates nodes (kubelets) toward the very end of the upgrade, but dodging problems there isn't quite the same as running a full test suite. At the same time it's not truly a 4.x+1 cluster so if we do choose to run the tests we should not be surprised if there's some portion which leverage new features that fail. I have a feeling that where we'll run into problems is when we upgrade the control plane to 4.x+2 but we still have kubelets at 4.x. This is where we'll likely run into operators which have always assumed a baseline feature set of 4.x+1. Perhaps node team have an opinion here? @rphillips @ehashman @harche |
||
| Worker MCP we will run 4.x+2 tests. | ||
|
|
||
| Given the complexity of these test jobs we should expect that they may take | ||
| longer than the current four hour limit for test jobs. Rather than compromising | ||
| on test completeness we will seek to extend test duration limits or find other | ||
| ways to meet these testing demands. | ||
|
|
||
| #### Teams with Host Components - Allow N-2 Host Component Version Skew | ||
|
|
||
| All teams which own components that directly interface with or ship host based | ||
| components will need to ensure that they're broadening their compatibility to | ||
| allow for N-2 version skew between host and cluster scoped resources. | ||
|
|
||
| This would include for example the SDN DaemonSets in 4.10 remaining compatible | ||
| with OVS and any other host components in 4.10, 4.9, and 4.8. On a case by case | ||
| basis teams should decide whether it makes more sense to maintain a broader | ||
| compatibility matrix or that N-1 bits and MachineConfig are backported to N-2 and | ||
| upgrade graph is amended with these new minimum version requirements. | ||
|
|
||
| For instance, if 4.9.12 is the minimum version for 4.9 to 4.10 upgrades we'd | ||
| ensure that the next 4.8.z shipping after 4.9.12 has RHCOS bits and MachineConfig | ||
| which offer parity with 4.9.12 so that it's not required that we reboot into | ||
| 4.9.12. If teams choose to pursue this option they will need to continue to ensure | ||
| that 4.7 to 4.8.z and 4.8.z-n to 4.8.z upgrades continue to work as well. | ||
|
|
||
| Teams which believe this is not achievable or the level of effort is extremely | ||
| high should document those findings. | ||
|
|
||
| Thus far RHCOS, Node, MCO, SDN, Containers, and PSAP teams are known to fall into | ||
| this group of teams which have components coupled to the host. | ||
|
|
||
| #### MCO - Widen Node Constraints to allow for N-2 | ||
|
|
||
| Building upon the EUS-to-EUS upgrade MVP work(https://github.com/openshift/enhancements/blob/master/enhancements/update/eus-upgrades-mvp.md#mco---enforce-openshifts-defined-host-component-version-skew-policies) | ||
| to allow MCO to enforce host constraints we will broaden those constraints to | ||
| enable host component version skew. | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Where can I see the complete list of constraints?
Reading both enhancements I see now the following constraints:
Are other entities/operators also allowed to set There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another constraint to consider would be
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We're iterating on design with MCO, the original proposal to examine node specific details like kubelet version or kernel version may not be accepted in favor of something more abstract like the version of MachineConfig templates currently applied to nodes being greater than or equal to some version. At very high level we're just seeking to ensure that we enforce whatever host component version skew policy we come up with by setting Upgradeable=False to prevent upgrades that would violate those policies. In 1) above MCO would only set Upgradeable=False when those constraints would be violated through a minor version upgrade, within this context we don't actually care if the pools are paused just that they're of some minimum version. All operators are enabled to set Upgradeable=False, it's the primary mechanism used by operators to assert that the cluster is not in a state which would allow for minor version upgrades. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sdodson That is fine, for such information we have an RFE for the machine-os-content to provide such information via annotations as part of the extension system. Just trying to understand the high-level constraints and the best way to "guard" special resources in a cluster from unintended upgrades. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The annotations are needed to support #357 https://issues.redhat.com/browse/GRPA-2713 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sdodson By all operators you mean operators that are managed via CVO and OLM? I had a question on my enhancement if OLM managed operators are allowed to create a ClusterOperator object to "control" upgradability.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I didn't think this was possible before, but turns out I was wrong, and the CVO uses an unfiltered list of all ClusterOperator objects, regardless of whether they are from CVO manifests or added by third parties, when it calculates whether minor bumps are blocked. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wking Thanks for pointing this out, started to read the code as well and doing some tests with custom MCPs and custom ClusterOperator objects. |
||
| Admins who choose to would then be able to skip a host reboot by following this | ||
| pattern: | ||
|
|
||
| 1. Starting with a 4.8 cluster, pause Worker MachineConfigPool | ||
| 1. Upgrade to 4.9 | ||
| 1. Upgrade to 4.10 | ||
| 1. Unpause Worker MachineConfigPool | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about custom MachineConfigPools? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One example would be e.g. real-time kernel workers that are in a separate MCP with additional MachineConfigs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All pools should inherit from
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do those inherit from Worker pool? This does make me wonder if we need to define our policies on a per pool basis, especially if we're considering that control plane pool must be updated. @darkmuggle another point to consider here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok just for clarification, if I pause a custom MCP this property will not be backpropagated to the parent worker MCP. MCO will rollout the upgrade on the worker MCP and since my custom MCP is paused it will wait until availability?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't think any pools need to be paused. Folks interested in reducing reboots can just pause the pools in which they want to reduce reboots, for the short time it takes to make a few consecutive update hops. I don't understand how pool-pausing interacts between the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Users could have self-defined constraints that are not yet applied or implemented in a piece of software. Either an operator is preventing the upgrade or an admin wishes to do some "day 2" customizations before upgrading, knowing that an upgrade could break something. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about workloads that are running longer (days, weeks) that cannot be checkpointed, wouldn't this also be a reason to pause an MCP? PodDisruptionBudgets with minAvailable=100% would prevent draining to finish but the Node would already be cordoned. The workload may need additional Pods to be scheduled on the node to finish the task.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There is a balance between wasting resources by not scheduling work on the node while you wait for the slow pod to wrap up, and getting stuck in an endless drain because the new work you let onto the node ends up also being guarded by a PDB or other mechanism and taking longer than the initial slow pod. Currently we weight in favor of quickest-possible-drain at the expense of underutilizing the available capacity. Folks who want to minimize wasted capacity can set
It's up to the actor managing that workload to either say "you know what, we should clear this slow work off so that poor node can update" or "I am unwilling to abandon this slow work, so I'm going to set sufficient tolerations and PDB guards on these new Pods so they can go join the slow worker on the cordoned node and help it push through to completion". I don't think that's something generic OpenShift tooling that doesn't understand the details of your workload can help you with out of the box. In both of these cases, pausing the pool is one way to reduce the likelihood of cordons and node reboots. But MachineConfigs are not the only reason your nodes might get cordoned or rolled. I think it's better for folks who have really expensive, slow work to write their own really smart controllers who can shepherd the work through a generic Kube-node environment instead of using the coarse pool-pausing knob to try to orchestrate the dance they want to see. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wking Right, we do not have to lengthen this. Either you want to update or you do not want to. If you want to update then make a plan on how to fit the upgrade procedure into your daily/weekly business. We can offer the tools and mechanics but "you" should do "your" homework. |
||
|
|
||
| Note that this can be decoupled in a way that when we ship 4.9 the initial MCO | ||
| could assert constraints which require 4.9 host components before upgrading to | ||
| 4.10. Then after we ship 4.10 and have sufficiently tested a 4.8 to 4.10 host | ||
| component version skew a later 4.9.z MCO could have its constraints broadened. | ||
| This allows us additional time to test broader version skews if we so choose. | ||
|
|
||
|
|
||
| ### Implementation Details/Notes/Constraints [optional] | ||
|
|
||
| What are the caveats to the implementation? What are some important details that | ||
| didn't come across above. Go in to as much detail as necessary here. This might | ||
| be a good place to talk about core concepts and how they relate. | ||
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| This imposes significant risk due to a number of factors: | ||
| - We're currently not confident in upstream's testing matrix and our specific | ||
| feature sets | ||
| - We've never before expected teams to offer broader than N-1 compatibility. | ||
| Teams have always assumed at most N-1 and even then, especially early after a | ||
| GA release, it's not uncommon to find problems in N-1 compatibility. | ||
| - While N-1 is tested in the context of upgrades it's not been tested in long | ||
| term use. | ||
| - If external integrations depend on host components of all nodes having been | ||
| updated then we'll run into problems. For instance, if there's an upgrade | ||
| scenario where RHV cloud provider integration needs to be upgraded between 4.6 | ||
| and 4.10 in order to ensure compatibility and the components which interface | ||
| with RHV are components of RHCOS then we may not upgrade those components at | ||
| the same minor version expected previously. | ||
|
|
||
| We may mitigate some of this by further delaying EUS-to-EUS upgades after normal | ||
| minor version upgrades have been promoted to stable and allocating significantly | ||
| more time and effort to testing efforts. Ultimately this introduces another | ||
| dimension to an already complex testing matrix. | ||
|
|
||
| ## Design Details | ||
|
|
||
| ### Open Questions [optional] | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To prevent an upgrade of specific nodes, the high-value assets of a cluster with special resources and constraints (kernel, os, ... ), one could create a custom MCP and set either An operator that does preflight checks on newer kernel, os could use the new information that is rolled out to the other nodes to check if the special resource would work on the update.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if an operator has tighter constraints than can be expressed by either minimum version of MachineConfig, kubelet version, or kernel then that logic should live outside of the MCO. The MCO is caring about this only because it can affect the host component versions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sdodson Understood, this is related to this: #357 my questions around this is, how SRO can prevent upgrades and guard the special resources from unwanted upgrades. Some kmods are sensitive to any kernel version change and some kmods (kABI whitelisted symbols) only care of about OS major changes (8.x -> 9.x) . |
||
|
|
||
| 1. Should we make these between specific named versions. ie: 4.6-4.8, and 4.8-4.10 | ||
| or should this be a standard N-2 rule, ie: 4.6-4.8, 4.7-4.9, 4.8-4.10? | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens with an MCP which is "never" unpaused? Will MCO force an upgrade if the MCP violates the constraint N-1 or N-2
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, we'd just inhibit upgrades that fall outside of our defined policy. That can still be forced around but anytime an admin chooses to force an upgrade they become responsible for the disaster they create.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Never unpausing a pool will sooner or later fail the nodes out as the kubelets stop getting Kube-API CA rotations, even if the admins don't force an update. We should be very clear that pool pauses are acceptable practice for a few hours, maybe into days, but that the whole time pools are paused, you are not getting CA rotations, bumped pull secrets, RHCOS bugfixes, and all the other good stuff that comes with having an unpaused pool. Folks who want to leave a pool paused for longer should be considering all of those consequences, and deciding for themselves if the reduced disruption is worth the risk. |
||
| ### Test Plan | ||
|
|
||
| This is actually a major focus of the entire effort, so we'll fill this out | ||
| now but expect to bring more clarity in the future once we have a better test | ||
| plan. | ||
|
|
||
| - We must have upstream N-2 version skew testing, which test suites should | ||
| be run at completion? e2e? | ||
| - We must have downstream N-2 version skew testing which meets or exceeds our | ||
| existing upgrade testing. We need to decide if this is install OCP 4.N and | ||
| RHCOS 4.N-2 or if this is install OCP 4.N-2 pause Worker MCP, upgrade twice, test. | ||
| The former will be quicker but the latter will be more representative of the | ||
| customer use case. | ||
| - We must decide how many platforms must be covered, all of them? tier 1? | ||
|
|
||
| ### Graduation Criteria | ||
|
|
||
| **Note:** *Section not required until targeted at a release.* | ||
|
|
||
| Define graduation milestones. | ||
|
|
||
| These may be defined in terms of API maturity, or as something else. Initial proposal | ||
| should keep this high-level with a focus on what signals will be looked at to | ||
| determine graduation. | ||
|
|
||
| Consider the following in developing the graduation criteria for this | ||
| enhancement: | ||
|
|
||
| - Maturity levels | ||
| - [`alpha`, `beta`, `stable` in upstream Kubernetes][maturity-levels] | ||
| - `Dev Preview`, `Tech Preview`, `GA` in OpenShift | ||
| - [Deprecation policy][deprecation-policy] | ||
|
|
||
| Clearly define what graduation means by either linking to the [API doc definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning), | ||
| or by redefining what graduation means. | ||
|
|
||
| In general, we try to use the same stages (alpha, beta, GA), regardless how the functionality is accessed. | ||
|
|
||
| [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions | ||
| [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ | ||
|
|
||
| **Examples**: These are generalized examples to consider, in addition | ||
| to the aforementioned [maturity levels][maturity-levels]. | ||
|
|
||
| #### Dev Preview -> Tech Preview | ||
|
|
||
| - Ability to utilize the enhancement end to end | ||
| - End user documentation, relative API stability | ||
| - Sufficient test coverage | ||
| - Gather feedback from users rather than just developers | ||
| - Enumerate service level indicators (SLIs), expose SLIs as metrics | ||
| - Write symptoms-based alerts for the component(s) | ||
|
|
||
| #### Tech Preview -> GA | ||
|
|
||
| - More testing (upgrade, downgrade, scale) | ||
| - Sufficient time for feedback | ||
| - Available by default | ||
| - Backhaul SLI telemetry | ||
| - Document SLOs for the component | ||
| - Conduct load testing | ||
|
|
||
| **For non-optional features moving to GA, the graduation criteria must include | ||
| end to end tests.** | ||
|
|
||
| #### Removing a deprecated feature | ||
|
|
||
| - Announce deprecation and support policy of the existing feature | ||
| - Deprecate the feature | ||
|
|
||
| ### Upgrade / Downgrade Strategy | ||
|
|
||
| Upgrade expectations: | ||
| - **NEW** This is new, the rest is the standard boiler plate which still applies -- | ||
| Admins may pause Worker MachineConfig pools at specifically defined product versions | ||
| then apply multipe minor version upgrades before having to un-pause the MCP in | ||
| order to upgrade to the next minor. | ||
| - Each component should remain available for user requests and | ||
| workloads during upgrades. Ensure the components leverage best practices in handling [voluntary disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to this should be | ||
| identified and discussed here. | ||
| - Micro version upgrades - users should be able to skip forward versions within a | ||
| minor release stream without being required to pass through intermediate | ||
| versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1` | ||
| as an intermediate step. | ||
| - Minor version upgrades - you only need to support `x.N->x.N+1` upgrade | ||
| steps. So, for example, it is acceptable to require a user running 4.3 to | ||
| upgrade to 4.5 with a `4.3->4.4` step followed by a `4.4->4.5` step. | ||
| - While an upgrade is in progress, new component versions should | ||
| continue to operate correctly in concert with older component | ||
| versions (aka "version skew"). For example, if a node is down, and | ||
| an operator is rolling out a daemonset, the old and new daemonset | ||
| pods must continue to work correctly even while the cluster remains | ||
| in this partially upgraded state for some time. | ||
|
|
||
| Downgrade expectations: | ||
| - When downgrading in conjunction with reboot avoidance as described in this | ||
| enhancement it is assumed that you will rollback at most one minor version, if | ||
| you had upgraded 4.8 to 4.9 to 4.10 then you would only be able to downgrade | ||
| back to 4.9. | ||
| - If you had paused MachineConfigPools they should remain paused. If you had | ||
| unpaused MachineConfigPools then those should remain unpaused when rolling back | ||
| so that host bound components similarly downgrade. | ||
| - If an `N->N+1` upgrade fails mid-way through, or if the `N+1` cluster is | ||
| misbehaving, it should be possible for the user to rollback to `N`. It is | ||
| acceptable to require some documented manual steps in order to fully restore | ||
| the downgraded cluster to its previous state. Examples of acceptable steps | ||
| include: | ||
| - Deleting any CVO-managed resources added by the new version. The | ||
| CVO does not currently delete resources that no longer exist in | ||
| the target version. | ||
|
|
||
| ### Version Skew Strategy | ||
|
|
||
| We will need to extensively test this new host to cluster scoped version skew. | ||
| For the time being we will only allow version skew across specific versions, | ||
| 4.8 to 4.10. This should be enforced via MCO mechanisms defined in a previous | ||
| enhancement(https://github.com/openshift/enhancements/blob/master/enhancements/update/eus-upgrades-mvp.md#mco---enforce-openshifts-defined-host-component-version-skew-policies). | ||
|
|
||
| Components which ship or interface directly with host bound components must ensure | ||
| that they've tested across our defined version skews. | ||
|
|
||
| Consider the following in developing a version skew strategy for this | ||
| enhancement: | ||
| - During an upgrade, we will always have skew among components, how will this impact your work? | ||
| - Does this enhancement involve coordinating behavior in the control plane and | ||
| in the kubelet? How does an n-2 kubelet without this feature available behave | ||
| when this feature is used? | ||
| - Will any other components on the node change? For example, changes to CSI, CRI | ||
| or CNI may require updating that component before the kubelet. | ||
|
|
||
| ## Implementation History | ||
|
|
||
| Major milestones in the life cycle of a proposal should be tracked in `Implementation | ||
| History`. | ||
|
|
||
| ## Drawbacks | ||
|
|
||
| This introduces significant additional compatibility testing dimensions to much | ||
| of the product. We should strongly consider whether reducing reboots by 50% is | ||
| worth it. | ||
|
|
||
| ## Alternatives | ||
|
|
||
| - We make no changes to our host component version skew policies. | ||
| - We find other ways to reduce workload disruption without expanding our compatibility testing matrices. | ||
|
|
||
| ## Infrastructure Needed [optional] | ||
|
|
||
| This effort will expand our CI requirements with additional test jobs which must | ||
| run at least once a week if not daily. Otherwise there's no net new projects or | ||
| repos expected. | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alex will be driving the MCO side of this discussion.