-
Notifications
You must be signed in to change notification settings - Fork 535
OTA-1544: update: support accepted risks #1807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,336 @@ | ||
| --- | ||
| title: accepted-risks | ||
| authors: | ||
| - "@hongkailiu" | ||
| reviewers: # Include a comment about what domain expertise a reviewer is expected to bring and what area of the enhancement you expect them to focus on. For example: - "@networkguru, for networking aspects, please look at IP bootstrapping aspect" | ||
| - "@wking" | ||
| - "@JoelSpeed" | ||
| approvers: # A single approver is preferred, the role of the approver is to raise important questions, help ensure the enhancement receives reviews from all applicable areas/SMEs, and determine when consensus is achieved such that the EP can move forward to implementation. Having multiple approvers makes it difficult to determine who is responsible for the actual approval. | ||
| - "@PratikMahajan" | ||
| api-approvers: | ||
| - "@JoelSpeed" | ||
| creation-date: 2025-06-11 | ||
| last-updated: 2025-12-22 | ||
| tracking-link: | ||
| - https://issues.redhat.com/browse/OTA-1544 | ||
| see-also: | ||
| - "/enhancements/update/targeted-update-edge-blocking.md" | ||
| --- | ||
|
|
||
| # accepted-risks | ||
|
|
||
| ## Summary | ||
|
|
||
| A cluster admin can express accepted risks for a cluster so that when [a conditional update](https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/updating_clusters/understanding-openshift-updates-1#conditional-updates-overview_understanding-update-channels-releases) of cluster is requested, it can be accepted only if the risks exposed to the conditional update are all accepted. | ||
|
|
||
| ## Motivation | ||
|
|
||
| It is to reduce the time and effort for cluster administrators who manage many clusters, enabling them to pre-approve certain risks that they deem acceptable across some or all of their environments. Although the evaluation of risks cannot be avoided, the evaluation result may be reused for other clusters. | ||
|
|
||
| ### User Stories | ||
|
|
||
| The enhancement is useful, e.g., in the following scenario: | ||
|
|
||
| - A cluster admin gets conditional updates of a cluster and evaluates the risks exposed to the conditional updates. | ||
| - After evaluation, the admin decides that some risks can be accepted on all or some of the managed clusters. | ||
| - The administrator configures relevant clusters to allow for updates, as long as any identified risks have all been accepted. | ||
|
|
||
| ### Goals | ||
|
|
||
| - Reduce the overall effort of managing a great number of clusters and evaluating risks from a conditional update. | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| - It does not change which impact a specific update risk may bring to a cluster and how it is evaluated by a cluster administrator. | ||
|
|
||
| ## Proposal | ||
|
|
||
| This enhancement proposes a way of specifying accepted risks regarding updating an OpenShift cluster. | ||
|
|
||
| ### Workflow Description | ||
|
|
||
| The workflow and its implementation are based on [the API extension](#api-extensions) for the following scenario as the starting point: The cluster administrator decides that those risks `DualStackNeedsController`, `OldBootImagesPodmanMissingAuthFlag`, and `LeakedMachineConfigBlocksMCO` are acceptable and wants to tell the cluster version operator to accept an update if all its associated risks are among those three (otherwise block it). | ||
|
|
||
| The update to `4.18.16` is not recommended because of the risks `DualStackNeedsController` `OldBootImagesPodmanMissingAuthFlag`, and `RHELKernelHighLoadIOWait` while `4.18.17` contains the fixes of `DualStackNeedsController` and `RHELKernelHighLoadIOWait`. The only remained risk of `4.18.17` is `OldBootImagesPodmanMissingAuthFlag`. | ||
|
|
||
|
|
||
| #### Workflow | ||
|
|
||
| Provide a new (sub-)command `accept` of `oc adm upgrade`, e.g., | ||
|
|
||
| > oc adm upgrade accept DualStackNeedsController,LeakedMachineConfigBlocksMCO,OldBootImagesPodmanMissingAuthFlag | ||
|
|
||
| whose only result is to append the provided values to the field `.spec.desiredUpdate.accept` on the `clusterversion/version` manifest: | ||
|
|
||
| ```console | ||
| $ oc clusterversion version -o yaml | ||
| apiVersion: config.openshift.io/v1 | ||
| kind: ClusterVersion | ||
| metadata: | ||
| name: version | ||
| spec: | ||
| channel: candidate-4.18 | ||
| clusterID: 1c182977-5663-428d-92a3-3d2bdf3fffb6 | ||
| desiredUpdate: | ||
| acceptRisks: | ||
| - name: DualStackNeedsController | ||
| - name: LeakedMachineConfigBlocksMCO | ||
| - name: OldBootImagesPodmanMissingAuthFlag | ||
| force: false | ||
| version: 4.18.15 | ||
| ``` | ||
|
|
||
| Currently each item in `cv.spec.desiredUpdate.acceptRisks` has only one field `name` but it may have more such as `reason` in the future if needed. | ||
|
|
||
| Then the upgrade to `4.18.16` by the command `oc adm upgrade --to 4.18.16` will be blocked as only the first two of them are accepted by the administrator and to `4.18.17` by `oc adm upgrade --to 4.18.17` will be accepted. | ||
|
|
||
| The risk name like `OldBootImagesPodmanMissingAuthFlag` is unique cross all releases and thus the growing accepted risks over time should not cause any problems. For any reason, if it is desired to clear them out, we may run the following command: | ||
|
|
||
| > oc adm upgrade accept --clear | ||
|
|
||
| and | ||
|
|
||
| > oc adm upgrade accept -Foo | ||
|
|
||
| can be used to remove `Foo` from the accepted risks which is no-op if `Foo` is currently not in `cv.spec.desiredUpdate.acceptRisks`. | ||
|
|
||
| No arguments (either in form of `Foo` or `-Foo`) are provided to the command lead to an error. | ||
|
|
||
| The `--replace` in the following following command is to replace the current accept risks (instead of appending by default) with `RiskA` and `RiskB`: | ||
|
|
||
| > oc adm upgrade accept --replace=true RiskA,RiskB | ||
|
|
||
| Note that we have to modify the [`patchDesiredUpdate`](https://github.com/openshift/oc/blob/f9d98d644110d3413dc4862002395d0c6dfc1da7/pkg/cli/admin/upgrade/upgrade.go#L682-L696) function so that it does not clobber the existing `acceptedRisks` property. | ||
|
|
||
| #### Workflow: Alternative | ||
|
|
||
| Provide the existing `oc adm upgrade` command with a new and optional `--accept` option to update a cluster, i.e., it take effect only when performing an cluster update. | ||
|
|
||
| > oc adm upgrade --to 4.18.16 --accept DualStackNeedsController,LeakedMachineConfigBlocksMCO,OldBootImagesPodmanMissingAuthFlag | ||
|
|
||
|
|
||
| The result of `--accept` in the above command is to set up values for the field `.spec.desiredUpdate.acceptRisks` on the `clusterversion/version` manifest: | ||
|
|
||
| ```console | ||
| $ oc clusterversion version -o yaml | ||
| apiVersion: config.openshift.io/v1 | ||
| kind: ClusterVersion | ||
| metadata: | ||
| name: version | ||
| spec: | ||
| channel: candidate-4.18 | ||
| clusterID: 1c182977-5663-428d-92a3-3d2bdf3fffb6 | ||
| desiredUpdate: | ||
| acceptRisks: | ||
| - name: DualStackNeedsController | ||
| - name: LeakedMachineConfigBlocksMCO | ||
| - name: OldBootImagesPodmanMissingAuthFlag | ||
| force: false | ||
| version: 4.18.16 | ||
| ``` | ||
|
|
||
| Note that missing `--accept` in the above command means accepting no risks at all and all existing accepted risks specified in `cv.spec.desiredUpdate.acceptRisks` is going to be removed. A cluster admin who chooses to do GitOps on the ClusterVersion manifest should not use `oc adm upgrade` to perform a cluster update. | ||
|
|
||
| The cluster update to `4.18.16` is blocked. When the cluster is updated to `4.18.17`, e.g., by the following command: | ||
|
|
||
| > oc adm upgrade --to 4.18.17 --accept DualStackNeedsController,LeakedMachineConfigBlocksMCO,OldBootImagesPodmanMissingAuthFlag | ||
|
|
||
| the CVO accepts the upgrade to `4.18.17` is accepted. | ||
|
|
||
| #### Comparing Two Workflows: | ||
|
|
||
| Benefits of `oc adm upgrade --to 4.y.z --accept RiskA,RiskB`: | ||
|
|
||
| * No need for a new subcommand, so manipulating `cv.spec.desiredUpdate` stays consolidated. | ||
| * Encourages cluster-admins to build their own systems for managing acceptance information (e.g. GitOps), because while the new `cv.spec.desiredUpdate.acceptRisks` allows admins to store the risks they accept, it doesn't have space (`name` is the only field at the moment) for them to explain why they find those risks acceptable, and that seems like important information that a team of admins sharing risk-acceptance decision-making would want to have available. | ||
|
|
||
| The reasons that we did not go with this workflow: The reviews are in favor of an explicit way of manipulating the accepted risks, without intervening with the command that triggers a cluster upgrade. | ||
|
|
||
| Benefits of `oc adm upgrade accept [--replace] RiskA,RiskB`: | ||
|
|
||
| * Convenient appending, so an admin can roll forward with the things previous admins have already accepted, without having to worry about what those were. | ||
| * Convenient way to share accepted risk names via ClusterVersion, without having to build your own system to share those between multiple cluster-admins. | ||
|
|
||
| ### API Extensions | ||
|
|
||
| This enhancement | ||
| - adds a new field 'clusterversion.spec.desiredUpdate.acceptRisks': It contains | ||
| the names of conditional update risks that are considered acceptable. | ||
| - moves `clusterversion.status.conditionalUpdates.risks` up and rename it as | ||
| `clusterversion.status.conditionalUpdateRisks`. It contains all the risks | ||
| for `clusterversion.status.conditionalUpdates`. | ||
| - adds a new field 'clusterversion.status.conditionalUpdates.riskNames': It | ||
| contains the names of risks for the conditional update. We have the intension | ||
| to deprecate `clusterversion.status.conditionalUpdates.risks` when the proposed | ||
| TechPreview-gated feature is promoted. Even after promotion, we still have to populate | ||
| the deprecated field until OpenShift 5 because it is a public v1 API. | ||
| - adds a new field 'clusterversion.status.conditionalUpdateRisks.conditions': It | ||
| contains the observations of the conditional update risk's current status. | ||
|
|
||
| For example, | ||
|
|
||
| ```console | ||
| $ oc clusterversion version -o yaml | ||
| apiVersion: config.openshift.io/v1 | ||
| kind: ClusterVersion | ||
| metadata: | ||
| name: version | ||
| spec: | ||
| channel: candidate-4.18 | ||
| clusterID: 1c182977-5663-428d-92a3-3d2bdf3fffb6 | ||
| desiredUpdate: | ||
| acceptedRisks: | ||
| - name: DualStackNeedsController | ||
| - name: LeakedMachineConfigBlocksMCO | ||
| - name: OldBootImagesPodmanMissingAuthFlag | ||
| force: false | ||
| version: 4.18.15 | ||
| status: | ||
| conditionalUpdateRisks: # include every risk in the conditional updates (moved up and renamed) | ||
hongkailiu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - name: DualStackNeedsController | ||
| message: Upgrade can get stuck on clusters that use multiple networks together with dual stack. | ||
| url: https://issues.redhat.com/browse/SDN-3996 | ||
| matchingRules: | ||
| - type: Always | ||
| conditions: | ||
| - status: True # Apply=True if the risk is applied to the current cluster | ||
| type: Applies | ||
| reason: MatchingRule | ||
| message: The matchingRules[0] matches | ||
| lastTransitionTime: 2021-09-13T17:03:05Z | ||
| - name: LeakedMachineConfigBlocksMCO | ||
| message: MCO update might be blocked if some condition holds | ||
| url: https://issues.redhat.com/browse/SDN-2212 | ||
| matchingRules: | ||
| - type: PromQL | ||
| promql: | ||
| promql: prom_query | ||
| conditions: | ||
| - status: False | ||
| type: Applies | ||
| reason: NoMatchingRule | ||
| message: None of the matching rules matches | ||
| lastTransitionTime: 2021-09-13T17:03:05Z | ||
| - name: DualStackNeedsController | ||
| ... | ||
| - name: RHELKernelHighLoadIOWait | ||
| ... | ||
| conditionalUpdates: | ||
| - release: | ||
| version: 4.18.16 | ||
| image: quay.io/openshift-release-dev/ocp-release@sha256:abc123 | ||
| riskNames: | ||
| - DualStackNeedsController | ||
| - OldBootImagesPodmanMissingAuthFlag | ||
| - RHELKernelHighLoadIOWait | ||
| risks: # deprecated by riskNames | ||
| - release: | ||
| version: 4.18.17 | ||
| riskNames: | ||
| - OldBootImagesPodmanMissingAuthFlag | ||
| risks: # deprecated by riskNames | ||
| - release: | ||
| version: 4.19.1 | ||
| riskNames: | ||
| - LeakedMachineConfigBlocksMCO | ||
| risks: # deprecated by riskNames | ||
| ``` | ||
|
|
||
| For every risk appearing in `conditionalUpdateRisksconditionalUpdates`, there is an element in `conditionalUpdateRisks.conditionalUpdateRisks` with the same name. Its condition tells if the risk applies to the current cluster. | ||
|
|
||
| When a conditional update is accepted, the names of its associated risks are going to be merged into `clusterversion.status.history.acceptedRisks` which is an existing field before this enhancement. For example, CVO's acceptance of `4.18.17` leads to `OldBootImagesPodmanMissingAuthFlag` being a part of value of `clusterversion.status.history.acceptedRisks`: The wording might look like "The target release ... is exposed to the risks [OldBootImagesPodmanMissingAuthFlag] and accepted by CVO because all of them are considered acceptable". | ||
hongkailiu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| As a future work, we might consider extending `clusterversion.status.history` with a new field `clusterversion.status.history.acceptedByAdmin`, instead of re-using `clusterversion.status.history.acceptedRisks`, e.g., | ||
|
|
||
| ```yaml | ||
| acceptedByAdmin: | ||
| - type: known-issue | ||
| riskName: LeakedMachineConfigBlocksMCO | ||
| - type: warnings | ||
| warnings: | ||
| - warning-one | ||
| - warning-two | ||
| ``` | ||
|
|
||
| where each `acceptedByAdmin.riskName` stands for a risk exposed to the update in the history. | ||
|
|
||
| When a conditional update is blocked, there is a condition in `clusterversion.status.conditions` with `ReleaseAccepted=False`, e.g., | ||
hongkailiu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ```yaml | ||
| - lastTransitionTime: 2022-10-11T14:16:13Z | ||
| message: Payload loaded version="4.18.16" ... is blocked because it is exposed to unaccepted risks [RHELKernelHighLoadIOWait] | ||
| reason: UnacceptedRisks | ||
| status: False | ||
| type: ReleaseAccepted | ||
| ``` | ||
|
|
||
| ### Topology Considerations | ||
|
|
||
| #### Hypershift / Hosted Control Planes | ||
|
|
||
| Are there any unique considerations for making this change work with | ||
| Hypershift? | ||
|
|
||
| The current [OCPSTRAT-2118](https://issues.redhat.com/browse/OCPSTRAT-2118) covers only Standalone clusters and HyperShift will be a followup to it. We need to ensure that he work for [OCPSTRAT-2118](https://issues.redhat.com/browse/OCPSTRAT-2118) does not break HyperShift. This can be achieved by disabling the function on a HyperShift cluster, like we do on a TechPreview disabled standalone cluster. | ||
|
|
||
| #### Standalone Clusters | ||
|
|
||
| Is the change relevant for standalone clusters? | ||
hongkailiu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Yes, the changes of ClusterVersion and `oc adm upgrade` we're proposing here are directly applicable to standalone clusters without needing further integration work in other components. | ||
|
|
||
| #### Single-node Deployments or MicroShift | ||
|
|
||
| This proposal is applicable to single-node but not to MicroShift which lacks a ClusterVersion and CVO, and manages updates via RPMs. | ||
|
|
||
| #### OpenShift Kubernetes Engine | ||
|
|
||
| The feature from this proposal does not depend on features that are excluded from the OKE product offering. | ||
|
|
||
| ### Implementation Details/Notes/Constraints | ||
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| ### Drawbacks | ||
|
|
||
| ## Alternatives (Not Implemented) | ||
|
|
||
| Currently more developers is in favor of Workflow 2 above. It is more likely we are going to implement it. | ||
|
|
||
| ## Open Questions [optional] | ||
|
|
||
| ## Test Plan | ||
|
|
||
| **Note:** *Section not required until targeted at a release.* | ||
|
|
||
| Consider the following in developing a test plan for this enhancement: | ||
| - Will there be e2e and integration tests, in addition to unit tests? Yes. | ||
| - How will it be tested in isolation vs with other components? The e2e test will stay in CVO repo. | ||
| - What additional testing is necessary to support managed OpenShift service-based offerings? We will need to ensure HyperShift still works with or without the feature enabled. | ||
|
|
||
|
|
||
| ## Graduation Criteria | ||
|
|
||
|
|
||
| ### Dev Preview -> Tech Preview | ||
|
|
||
| N/A. | ||
|
|
||
| ### Tech Preview -> GA | ||
|
|
||
| ### Removing a deprecated feature | ||
|
|
||
| ## Upgrade / Downgrade Strategy | ||
hongkailiu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| In order to unlock this feature, cluster admins need to set up the environment variable `OC_ENABLE_CMD_UPGRADE_ACCEPT_RISKS=true` when `oc` command is issued and the feature gate `ClusterUpdateAcceptedRisks` has to be enabled on the cluster. Any one of the these two is missing, the feature is disabled. | ||
|
|
||
| As long as the implementation from this enhancement stays as a TechPreview-gated feature, our end users do not see any UX difference. After the feature promotion, we will display a deprecation message to `clusterversion.status.conditionalUpdates.risks` to encourage our users to use 'clusterversion.status.conditionalUpdates.riskNames' instead. Since `clusterversion.status.conditionalUpdates.risks` is a public v1 API, we still have to populate it until the next major version: OpenShift 5. | ||
|
|
||
|
|
||
| ## Version Skew Strategy | ||
|
|
||
| ## Operational Aspects of API Extensions | ||
|
|
||
| ## Support Procedures | ||
|
|
||
| ## Infrastructure Needed [optional] | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.