OTA-1544: update: support accepted risks #1807

hongkailiu · 2025-06-10T22:51:27Z

The proposed API extension is in openshift/api#2360

enhancements/update/accepted-risks.md

…isk" Make it clearer how '--accept ConditionalUpdateRisk' maps to a risk like NonZonalAzureMachineSetScaling getting accepted, by turning the previous: Reason: accepted NonZonalAzureMachineSetScaling into: Reason: accepted NonZonalAzureMachineSetScaling via ConditionalUpdateRisk Eventually we'll have an API that allows us to use the conditional-update risk name itself (e.g. '--accept NonZonalAzureMachineSetScaling') [1,2], but this 'via...' context will hopefully help avoid confusion in the meantime. [1]: openshift/enhancements#1807 [2]: https://issues.redhat.com/browse/OTA-1543

enhancements/update/accepted-risks.md

petr-muller · 2025-07-04T13:23:52Z

enhancements/update/accepted-risks.md

+
+Note that missing `--accept` in the above command means accepting no risks at all and `cv.spec.desiredUpdate.accept` is going to set to the empty set. A cluster admin who chooses to do GitOps on the ClusterVersion manifest should not use `oc adm upgrade` to perform a cluster update.
+
+The cluster-version operator finds that the update to `4.18.16` is not recommended because of the risks `DualStackNeedsController` `OldBootImagesPodmanMissingAuthFlag`, and `RHELKernelHighLoadIOWait` and only the first two of them are accepted by the administrator. Thus, the cluster update to `4.18.16` is blocked. After a couple of weeks, `4.18.17` is released which contains the fixes of `DualStackNeedsController` and `RHELKernelHighLoadIOWait`. The only remained risk of `4.18.17` is `OldBootImagesPodmanMissingAuthFlag`. When the cluster is updated to `4.18.17`, e.g., by the following command:


Thus, the cluster update to 4.18.16 is blocked.

How would this "blocked" be reported to the user? I am asking because the current check is client-side AFAIK, and is a part of oc adm upgrade client code. If there is a risk on the selected version, oc adm upgrade will not let you proceed without supplying the --allow-not-recommended option, and will not update the .spec.desiredUpdate. But if you e.g. gitops or oc edit the ClusterVersion manually, providing the .spec.desiredUpdate, there is nothing in the CVO that would stop the update (CVO relies on the client-side oc adm upgrade check).

This indicates the check moves to the controller; how would it report the "blocked" state?

And also, would we port the oc adm upgrade client-side check to this new mechanism --allow-not-recommended is basically equivalent to "accept all risks on this path". Do I understand correctly that after the change, a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check, it would edit the .spec.desiredIUpdate but the list of accepted updates would be empty so CVO would not start the update until the risks evaluate away or the admin accepts them post-hoc?

a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check

If this went through, and therefore explicitly cleared accept, then how does adding the explicit names of the accepted risks become re-usable over the course of several upgrades?

Or does this only get accepted if the accept list already contains all of the related risks?

In my mind, we will add the guard to CVO and do not have to change the existing client side guard.

Assume a cluster admin wants to do a condition update to 1.2.3.

Today, a cluster will do oc adm upgrade --to 1.2.3 --allow-not-recommended (as Petr pointed out, --allow-not-recommended is needed because the update to 1.2.3 is conditional).

After the enhancement, the cluster admin will still do the above command to trigger the update. Thus, the .spec.desiredUpdate is modified accordingly.

Then CVO sees the new .spec.desiredUpdate and checks if it is a conditional update:

If yes, then check if its risks are included in spec.desiredUpdate.accept.

if yes, accept and proceed with the update.

if no, blocked the update. We will set up ReleaseAccepted=False as Petr realized already here.

if no, proceed with the upgrade.

Joel, I hope the above addresses your questions too. Let me know otherwise.

If this went through, and therefore explicitly cleared accept, then how does adding the explicit names of the accepted risks become re-usable over the course of several upgrades?

After re-reading your comment above again today, I think I understand the question better now.

Since spec.desiredUpdate.accept is going to be cleared if --accept is not used in the command, that means the admin decide NOT to reuse the accepted risks, i.e., the risks are no longer accepted to the admin. Otherwise, --accept would have been provided.

Moreover, a more common use case is to reuse the risk evaluation across clusters. Once an admin upgrade a cluster with oc adm upgrade --to 1.2.3 --accept RiskA,RiskB and is happy with the result, then the same command could be applied to other (similar) clusters.

Hmm, thinking about this, we recently talked about this in the API PR didn't we? The behaviour you've described isn't aligning with my expectations from that discussion

In the API, we've said it's up to the end user to clear the list. Do we still expect this behaviour in the CLI?

I'm wondering if it makes sense for the accepted risks to just be persisted and have an easier way to iterate over multiple upgrades with the same risk names? Or does that not happen?

If we are all agreed, can we get the EP updated to reflect that? And move the workflow we decided against into the alternatives section and explain why we didn't implement it?

@wking

I would like to get you on board before moving Option 1 into the alternatives option.
IIUC, you like Option 1 slightly more than the other.

And also, would we port the oc adm upgrade client-side check to this new mechanism --allow-not-recommended is basically equivalent to "accept all risks on this path". Do I understand correctly that after the change, a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check, it would edit the .spec.desiredIUpdate but the list of accepted updates would be empty so CVO would not start the update until the risks evaluate away or the admin accepts them post-hoc?

This one is troubling me. Does the CVO have any status that reports the known risks for an upgrade? Is there a way that the CLI could check the accepted risks before returning? I think it would be a real shame to have to resort to asynchronous "yes your upgrade will start now" feedback because of this feature when this is something our end users have been used to for some time

Does the CVO have any status that reports the known risks for an upgrade?

Yes. Say 4.x.y contains RiskA, i.e., RiskA is in cv.status.conditionalUpdates.riskNames and cv.status.conditionalUpdateRisks["RiskA"] shows its condition.

A small paragraph is made in the new patch to tell this explicitly.

Is there a way that the CLI could check the accepted risks before returning?

Currently there isn't a cli command to show it. But we plan to do it as follow up to the current epic.

move the workflow we decided against into the alternatives section and explain why we didn't implement it?

Done

enhancements/update/accepted-risks.md

…isk" Make it clearer how '--accept ConditionalUpdateRisk' maps to a risk like NonZonalAzureMachineSetScaling getting accepted, by turning the previous: Reason: accepted NonZonalAzureMachineSetScaling into: Reason: accepted NonZonalAzureMachineSetScaling via ConditionalUpdateRisk Eventually we'll have an API that allows us to use the conditional-update risk name itself (e.g. '--accept NonZonalAzureMachineSetScaling') [1,2], but this 'via...' context will hopefully help avoid confusion in the meantime. [1]: openshift/enhancements#1807 [2]: https://issues.redhat.com/browse/OTA-1543

JoelSpeed

Reviewing the comments from @petr-muller, it seems there is a significant UX question about this enhancement. I agree with pretty much every of the UX issues raised by @petr-muller and I think we need to have some further discussion here before we commit to a particular API change

enhancements/update/accepted-risks.md

JoelSpeed · 2025-07-11T13:55:05Z

enhancements/update/accepted-risks.md

+
+Note that missing `--accept` in the above command means accepting no risks at all and `cv.spec.desiredUpdate.accept` is going to set to the empty set. A cluster admin who chooses to do GitOps on the ClusterVersion manifest should not use `oc adm upgrade` to perform a cluster update.
+
+The cluster-version operator finds that the update to `4.18.16` is not recommended because of the risks `DualStackNeedsController` `OldBootImagesPodmanMissingAuthFlag`, and `RHELKernelHighLoadIOWait` and only the first two of them are accepted by the administrator. Thus, the cluster update to `4.18.16` is blocked. After a couple of weeks, `4.18.17` is released which contains the fixes of `DualStackNeedsController` and `RHELKernelHighLoadIOWait`. The only remained risk of `4.18.17` is `OldBootImagesPodmanMissingAuthFlag`. When the cluster is updated to `4.18.17`, e.g., by the following command:


a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check

If this went through, and therefore explicitly cleared accept, then how does adding the explicit names of the accepted risks become re-usable over the course of several upgrades?

Or does this only get accepted if the accept list already contains all of the related risks?

enhancements/update/accepted-risks.md

hongkailiu · 2025-07-17T23:07:29Z

I am still in the process of working on the next patch to address the comments from reviews.
My replies so far should unblock some of the discussion on UX from this enhancement.
I will continue addresses remained ones tomorrow.

JoelSpeed · 2025-12-08T17:24:40Z

@hongkailiu Do you have time to get this updated so we can get this merged? IIUC the feature is far enough along that we should really be merging this for documentation purposes

hongkailiu · 2025-12-08T23:15:09Z

@hongkailiu Do you have time to get this updated so we can get this merged? IIUC the feature is far enough along that we should really be merging this for documentation purposes

Joel, I am back to work on this for 4.22. I am polishing the pull to address the remained comments. You will see my new commit soon.

enhancements/update/accepted-risks.md

JoelSpeed · 2025-12-10T13:54:37Z

enhancements/update/accepted-risks.md

+
+Note that missing `--accept` in the above command means accepting no risks at all and `cv.spec.desiredUpdate.accept` is going to set to the empty set. A cluster admin who chooses to do GitOps on the ClusterVersion manifest should not use `oc adm upgrade` to perform a cluster update.
+
+The cluster-version operator finds that the update to `4.18.16` is not recommended because of the risks `DualStackNeedsController` `OldBootImagesPodmanMissingAuthFlag`, and `RHELKernelHighLoadIOWait` and only the first two of them are accepted by the administrator. Thus, the cluster update to `4.18.16` is blocked. After a couple of weeks, `4.18.17` is released which contains the fixes of `DualStackNeedsController` and `RHELKernelHighLoadIOWait`. The only remained risk of `4.18.17` is `OldBootImagesPodmanMissingAuthFlag`. When the cluster is updated to `4.18.17`, e.g., by the following command:


And also, would we port the oc adm upgrade client-side check to this new mechanism --allow-not-recommended is basically equivalent to "accept all risks on this path". Do I understand correctly that after the change, a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check, it would edit the .spec.desiredIUpdate but the list of accepted updates would be empty so CVO would not start the update until the risks evaluate away or the admin accepts them post-hoc?

This one is troubling me. Does the CVO have any status that reports the known risks for an upgrade? Is there a way that the CLI could check the accepted risks before returning? I think it would be a real shame to have to resort to asynchronous "yes your upgrade will start now" feedback because of this feature when this is something our end users have been used to for some time

openshift-ci-robot · 2025-12-10T15:52:05Z

@hongkailiu: This pull request references OTA-1544 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

The proposed API extension is in openshift/api#2360

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

JoelSpeed · 2025-12-22T10:45:53Z

The new CLI workflow described LGTM. I think I have no further feedback on this one for now.

Should we aim to get this merged? @wking are you also happy enough with this doc in its current state?

wking · 2025-12-22T21:07:46Z

@wking are you also happy enough with this doc in its current state?

Yup! There are enough folks who are opinionated about the finer details of these APIs, and in order to avoid slowing the process further, I'm personally very pragmatic about deferring to folks with strong opinions. I'm happy with this doc merging at any stage, if that unblocks openshift/api#2360 to get us a tech-preview feature gate and initial, gated ClusterVersion properties to work with. If that tech-preview work turns up anything we want to adjust in the enhancement doc, that's what tech-preview investigation is designed to do 🤷. Certainly don't wait on me for anything here; I'm happy to reserve my semi-gating opinions for repositories where I'm an approver.

hongkailiu · 2025-12-23T03:09:32Z

@JoelSpeed and @wking , thanks for the feedback.
Do you mind lgtm the pull while I am asking for @PratikMahajan 's approval?

wking

/lgtm

PratikMahajan

/approve

openshift-ci-robot · 2026-01-09T17:40:23Z

/retest-required

Remaining retests: 0 against base HEAD ac10097 and 2 for PR HEAD d11db19 in total

openshift-ci · 2026-01-10T14:47:11Z

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

wking

New OpenShift Kubernetes Engine section makes sense to me (we'll work their too). Otherwise unchanged from the previously-LGTMed content.

/lgtm

openshift-ci · 2026-01-10T21:02:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, PratikMahajan, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~enhancements/update/OWNERS~~ [PratikMahajan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from PratikMahajan and petr-muller June 10, 2025 22:51

hongkailiu changed the title ~~[OTA-1544]update: support accepted risks~~ [wip][OTA-1544]update: support accepted risks Jun 10, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2025

hongkailiu force-pushed the OTA-1544 branch 2 times, most recently from 247c6c4 to 5778409 Compare June 10, 2025 23:46

hongkailiu mentioned this pull request Jun 10, 2025

[OTA-1545] Extend ClusterVersion for accepted risks openshift/api#2360

Merged

hongkailiu force-pushed the OTA-1544 branch from 5778409 to 0150e56 Compare June 11, 2025 14:15

wking reviewed Jun 16, 2025

View reviewed changes

enhancements/update/accepted-risks.md Outdated Show resolved Hide resolved

wking reviewed Jun 16, 2025

View reviewed changes

enhancements/update/accepted-risks.md Outdated Show resolved Hide resolved

wking reviewed Jun 16, 2025

View reviewed changes

enhancements/update/accepted-risks.md Outdated Show resolved Hide resolved

wking reviewed Jun 16, 2025

View reviewed changes

enhancements/update/accepted-risks.md Outdated Show resolved Hide resolved

wking reviewed Jun 16, 2025

View reviewed changes

enhancements/update/accepted-risks.md Show resolved Hide resolved

wking reviewed Jun 16, 2025

View reviewed changes

enhancements/update/accepted-risks.md Outdated Show resolved Hide resolved

wking reviewed Jun 16, 2025

View reviewed changes

enhancements/update/accepted-risks.md Outdated Show resolved Hide resolved

hongkailiu force-pushed the OTA-1544 branch 2 times, most recently from 5c2ed41 to 77497e7 Compare June 17, 2025 20:01

wking mentioned this pull request Jun 18, 2025

OTA-1575: pkg/cli/admin/upgrade/recommend: "accepted ... via ConditionalUpdateRisk" openshift/oc#2040

Merged

hongkailiu force-pushed the OTA-1544 branch from 77497e7 to f84b3d3 Compare June 25, 2025 13:09

wking reviewed Jun 27, 2025

View reviewed changes

enhancements/update/accepted-risks.md Outdated Show resolved Hide resolved

hongkailiu force-pushed the OTA-1544 branch 3 times, most recently from dd72591 to 191fcab Compare July 3, 2025 00:59

petr-muller reviewed Jul 4, 2025

View reviewed changes

JoelSpeed reviewed Jul 11, 2025

View reviewed changes

hongkailiu force-pushed the OTA-1544 branch from 191fcab to d53c489 Compare July 18, 2025 18:43

hongkailiu force-pushed the OTA-1544 branch from d53c489 to 50cb470 Compare August 1, 2025 19:46

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2025

hongkailiu force-pushed the OTA-1544 branch 2 times, most recently from 42095b2 to 9a6e677 Compare December 9, 2025 19:03

JoelSpeed reviewed Dec 10, 2025

View reviewed changes

hongkailiu changed the title ~~[wip][OTA-1544]update: support accepted risks~~ OTA-1544: update: support accepted risks Dec 10, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 10, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 10, 2025

hongkailiu force-pushed the OTA-1544 branch from 9a6e677 to dc81a2e Compare December 10, 2025 21:36

hongkailiu force-pushed the OTA-1544 branch from dc81a2e to 7b0ce54 Compare December 22, 2025 06:59

[OTA-1544]update: support accepted risks

d11db19

hongkailiu force-pushed the OTA-1544 branch from 7b0ce54 to d11db19 Compare December 22, 2025 07:19

wking approved these changes Dec 23, 2025

View reviewed changes

openshift-ci bot assigned wking Dec 23, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 23, 2025

PratikMahajan approved these changes Jan 9, 2026

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 9, 2026

Add a OKE section

a987de8

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2026

wking approved these changes Jan 10, 2026

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2026

openshift-merge-bot bot merged commit 3faf333 into openshift:master Jan 10, 2026
2 checks passed


		Note that missing `--accept` in the above command means accepting no risks at all and `cv.spec.desiredUpdate.accept` is going to set to the empty set. A cluster admin who chooses to do GitOps on the ClusterVersion manifest should not use `oc adm upgrade` to perform a cluster update.

		The cluster-version operator finds that the update to `4.18.16` is not recommended because of the risks `DualStackNeedsController` `OldBootImagesPodmanMissingAuthFlag`, and `RHELKernelHighLoadIOWait` and only the first two of them are accepted by the administrator. Thus, the cluster update to `4.18.16` is blocked. After a couple of weeks, `4.18.17` is released which contains the fixes of `DualStackNeedsController` and `RHELKernelHighLoadIOWait`. The only remained risk of `4.18.17` is `OldBootImagesPodmanMissingAuthFlag`. When the cluster is updated to `4.18.17`, e.g., by the following command:

OTA-1544: update: support accepted risks #1807

OTA-1544: update: support accepted risks #1807

Uh oh!

Conversation

hongkailiu commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongkailiu Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hongkailiu commented Jul 17, 2025

Uh oh!

JoelSpeed commented Dec 8, 2025

Uh oh!

hongkailiu commented Dec 8, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Dec 10, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoelSpeed commented Dec 22, 2025

Uh oh!

wking commented Dec 22, 2025

Uh oh!

hongkailiu commented Dec 23, 2025

Uh oh!

wking left a comment

hongkailiu commented Jun 10, 2025 •

edited

Loading

hongkailiu Jul 17, 2025 •

edited

Loading

openshift-ci-robot commented Dec 10, 2025 •

edited by openshift-ci bot

Loading