-
Notifications
You must be signed in to change notification settings - Fork 534
OTA-1544: update: support accepted risks #1807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
247c6c4 to
5778409
Compare
5c2ed41 to
77497e7
Compare
…isk" Make it clearer how '--accept ConditionalUpdateRisk' maps to a risk like NonZonalAzureMachineSetScaling getting accepted, by turning the previous: Reason: accepted NonZonalAzureMachineSetScaling into: Reason: accepted NonZonalAzureMachineSetScaling via ConditionalUpdateRisk Eventually we'll have an API that allows us to use the conditional-update risk name itself (e.g. '--accept NonZonalAzureMachineSetScaling') [1,2], but this 'via...' context will hopefully help avoid confusion in the meantime. [1]: openshift/enhancements#1807 [2]: https://issues.redhat.com/browse/OTA-1543
dd72591 to
191fcab
Compare
|
|
||
| Note that missing `--accept` in the above command means accepting no risks at all and `cv.spec.desiredUpdate.accept` is going to set to the empty set. A cluster admin who chooses to do GitOps on the ClusterVersion manifest should not use `oc adm upgrade` to perform a cluster update. | ||
|
|
||
| The cluster-version operator finds that the update to `4.18.16` is not recommended because of the risks `DualStackNeedsController` `OldBootImagesPodmanMissingAuthFlag`, and `RHELKernelHighLoadIOWait` and only the first two of them are accepted by the administrator. Thus, the cluster update to `4.18.16` is blocked. After a couple of weeks, `4.18.17` is released which contains the fixes of `DualStackNeedsController` and `RHELKernelHighLoadIOWait`. The only remained risk of `4.18.17` is `OldBootImagesPodmanMissingAuthFlag`. When the cluster is updated to `4.18.17`, e.g., by the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus, the cluster update to
4.18.16is blocked.
How would this "blocked" be reported to the user? I am asking because the current check is client-side AFAIK, and is a part of oc adm upgrade client code. If there is a risk on the selected version, oc adm upgrade will not let you proceed without supplying the --allow-not-recommended option, and will not update the .spec.desiredUpdate. But if you e.g. gitops or oc edit the ClusterVersion manually, providing the .spec.desiredUpdate, there is nothing in the CVO that would stop the update (CVO relies on the client-side oc adm upgrade check).
This indicates the check moves to the controller; how would it report the "blocked" state?
And also, would we port the oc adm upgrade client-side check to this new mechanism --allow-not-recommended is basically equivalent to "accept all risks on this path". Do I understand correctly that after the change, a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check, it would edit the .spec.desiredIUpdate but the list of accepted updates would be empty so CVO would not start the update until the risks evaluate away or the admin accepts them post-hoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check
If this went through, and therefore explicitly cleared accept, then how does adding the explicit names of the accepted risks become re-usable over the course of several upgrades?
Or does this only get accepted if the accept list already contains all of the related risks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind, we will add the guard to CVO and do not have to change the existing client side guard.
Assume a cluster admin wants to do a condition update to 1.2.3.
Today, a cluster will do oc adm upgrade --to 1.2.3 --allow-not-recommended (as Petr pointed out, --allow-not-recommended is needed because the update to 1.2.3 is conditional).
After the enhancement, the cluster admin will still do the above command to trigger the update. Thus, the .spec.desiredUpdate is modified accordingly.
Then CVO sees the new .spec.desiredUpdate and checks if it is a conditional update:
- If yes, then check if its risks are included in
spec.desiredUpdate.accept.- if yes, accept and proceed with the update.
- if no, blocked the update. We will set up
ReleaseAccepted=Falseas Petr realized already here.
- if no, proceed with the upgrade.
Joel, I hope the above addresses your questions too. Let me know otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this went through, and therefore explicitly cleared accept, then how does adding the explicit names of the accepted risks become re-usable over the course of several upgrades?
After re-reading your comment above again today, I think I understand the question better now.
Since spec.desiredUpdate.accept is going to be cleared if --accept is not used in the command, that means the admin decide NOT to reuse the accepted risks, i.e., the risks are no longer accepted to the admin. Otherwise, --accept would have been provided.
Moreover, a more common use case is to reuse the risk evaluation across clusters. Once an admin upgrade a cluster with oc adm upgrade --to 1.2.3 --accept RiskA,RiskB and is happy with the result, then the same command could be applied to other (similar) clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, thinking about this, we recently talked about this in the API PR didn't we? The behaviour you've described isn't aligning with my expectations from that discussion
In the API, we've said it's up to the end user to clear the list. Do we still expect this behaviour in the CLI?
I'm wondering if it makes sense for the accepted risks to just be persisted and have an easier way to iterate over multiple upgrades with the same risk names? Or does that not happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are all agreed, can we get the EP updated to reflect that? And move the workflow we decided against into the alternatives section and explain why we didn't implement it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to get you on board before moving Option 1 into the alternatives option.
IIUC, you like Option 1 slightly more than the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also, would we port the oc adm upgrade client-side check to this new mechanism --allow-not-recommended is basically equivalent to "accept all risks on this path". Do I understand correctly that after the change, a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check, it would edit the .spec.desiredIUpdate but the list of accepted updates would be empty so CVO would not start the update until the risks evaluate away or the admin accepts them post-hoc?
This one is troubling me. Does the CVO have any status that reports the known risks for an upgrade? Is there a way that the CLI could check the accepted risks before returning? I think it would be a real shame to have to resort to asynchronous "yes your upgrade will start now" feedback because of this feature when this is something our end users have been used to for some time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the CVO have any status that reports the known risks for an upgrade?
Yes. Say 4.x.y contains RiskA, i.e., RiskA is in cv.status.conditionalUpdates.riskNames and cv.status.conditionalUpdateRisks["RiskA"] shows its condition.
A small paragraph is made in the new patch to tell this explicitly.
Is there a way that the CLI could check the accepted risks before returning?
Currently there isn't a cli command to show it. But we plan to do it as follow up to the current epic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move the workflow we decided against into the alternatives section and explain why we didn't implement it?
Done
…isk" Make it clearer how '--accept ConditionalUpdateRisk' maps to a risk like NonZonalAzureMachineSetScaling getting accepted, by turning the previous: Reason: accepted NonZonalAzureMachineSetScaling into: Reason: accepted NonZonalAzureMachineSetScaling via ConditionalUpdateRisk Eventually we'll have an API that allows us to use the conditional-update risk name itself (e.g. '--accept NonZonalAzureMachineSetScaling') [1,2], but this 'via...' context will hopefully help avoid confusion in the meantime. [1]: openshift/enhancements#1807 [2]: https://issues.redhat.com/browse/OTA-1543
JoelSpeed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing the comments from @petr-muller, it seems there is a significant UX question about this enhancement. I agree with pretty much every of the UX issues raised by @petr-muller and I think we need to have some further discussion here before we commit to a particular API change
|
|
||
| Note that missing `--accept` in the above command means accepting no risks at all and `cv.spec.desiredUpdate.accept` is going to set to the empty set. A cluster admin who chooses to do GitOps on the ClusterVersion manifest should not use `oc adm upgrade` to perform a cluster update. | ||
|
|
||
| The cluster-version operator finds that the update to `4.18.16` is not recommended because of the risks `DualStackNeedsController` `OldBootImagesPodmanMissingAuthFlag`, and `RHELKernelHighLoadIOWait` and only the first two of them are accepted by the administrator. Thus, the cluster update to `4.18.16` is blocked. After a couple of weeks, `4.18.17` is released which contains the fixes of `DualStackNeedsController` and `RHELKernelHighLoadIOWait`. The only remained risk of `4.18.17` is `OldBootImagesPodmanMissingAuthFlag`. When the cluster is updated to `4.18.17`, e.g., by the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check
If this went through, and therefore explicitly cleared accept, then how does adding the explicit names of the accepted risks become re-usable over the course of several upgrades?
Or does this only get accepted if the accept list already contains all of the related risks?
|
I am still in the process of working on the next patch to address the comments from reviews. |
|
@hongkailiu Do you have time to get this updated so we can get this merged? IIUC the feature is far enough along that we should really be merging this for documentation purposes |
Joel, I am back to work on this for 4.22. I am polishing the pull to address the remained comments. You will see my new commit soon. |
42095b2 to
9a6e677
Compare
|
|
||
| Note that missing `--accept` in the above command means accepting no risks at all and `cv.spec.desiredUpdate.accept` is going to set to the empty set. A cluster admin who chooses to do GitOps on the ClusterVersion manifest should not use `oc adm upgrade` to perform a cluster update. | ||
|
|
||
| The cluster-version operator finds that the update to `4.18.16` is not recommended because of the risks `DualStackNeedsController` `OldBootImagesPodmanMissingAuthFlag`, and `RHELKernelHighLoadIOWait` and only the first two of them are accepted by the administrator. Thus, the cluster update to `4.18.16` is blocked. After a couple of weeks, `4.18.17` is released which contains the fixes of `DualStackNeedsController` and `RHELKernelHighLoadIOWait`. The only remained risk of `4.18.17` is `OldBootImagesPodmanMissingAuthFlag`. When the cluster is updated to `4.18.17`, e.g., by the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also, would we port the oc adm upgrade client-side check to this new mechanism --allow-not-recommended is basically equivalent to "accept all risks on this path". Do I understand correctly that after the change, a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check, it would edit the .spec.desiredIUpdate but the list of accepted updates would be empty so CVO would not start the update until the risks evaluate away or the admin accepts them post-hoc?
This one is troubling me. Does the CVO have any status that reports the known risks for an upgrade? Is there a way that the CLI could check the accepted risks before returning? I think it would be a real shame to have to resort to asynchronous "yes your upgrade will start now" feedback because of this feature when this is something our end users have been used to for some time
|
@hongkailiu: This pull request references OTA-1544 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
The new CLI workflow described LGTM. I think I have no further feedback on this one for now. Should we aim to get this merged? @wking are you also happy enough with this doc in its current state? |
Yup! There are enough folks who are opinionated about the finer details of these APIs, and in order to avoid slowing the process further, I'm personally very pragmatic about deferring to folks with strong opinions. I'm happy with this doc merging at any stage, if that unblocks openshift/api#2360 to get us a tech-preview feature gate and initial, gated ClusterVersion properties to work with. If that tech-preview work turns up anything we want to adjust in the enhancement doc, that's what tech-preview investigation is designed to do 🤷. Certainly don't wait on me for anything here; I'm happy to reserve my semi-gating opinions for repositories where I'm an approver. |
|
@JoelSpeed and @wking , thanks for the feedback. |
wking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
PratikMahajan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
|
@hongkailiu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
wking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New OpenShift Kubernetes Engine section makes sense to me (we'll work their too). Otherwise unchanged from the previously-LGTMed content.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, PratikMahajan, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The proposed API extension is in openshift/api#2360