Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Sep 30, 2024

Building on the CI that merged in #1885.

We had been showing all the recommended updates in decreasing order with --include-not-recommended to see all the updates-with-assessed-risks in decreasing order. But sometimes users want to update to the longest-hop, even if there are known risks. Or they want to read about the assessed risks, in case there's something they can do to their cluster to mitigate a currently-assessed risk before kicking off the update. This commit adjusts the output to order roughly by release freshness For example, for a 4.y cluster in a 4.(y+1) channel:

  • 4.(y+1).tip
  • 4.(y+1).(tip-1)
  • 4.y.tip
  • 4.y.(tip-1)
  • ...

Because users are more likely to care about 4.(y+1).tip, even if it has assessed risks, than they are to care about 4.y.reallyOld, even if it doesn't have assessed risks.

I'm showing two releases per z-stream at the moment, with --show-outdated-releases to see all the results ("releases" to leave the door open to having things like a single-arch and a multi-arch target with the same version string). And I'm dropping --include-not-recommended, now that we are only filtering by age.

I'm also dropping "where are we now?" details, like Progressing=False, because those are less relevant for folks who are trying to pick which release they want to target next. I still call out Failing=True and Progressing=True, because oc adm upgrade ... has client-side checks for those. Folks who are concerned with how their cluster is currently doing can run oc adm upgrade ... or the tech-preview oc adm upgrade status ...).

I'm also dropping ReleaseAccepted, which is in a weird space after the admin requests an update but before the cluster-version operator accepts the retarget request. Because oc adm upgrade is the subcommand used to request an update (--to and --to-image), I think that's the appropriate place to surface that condition (and it already does), so we don't need to surface that condition in the recommend subcommand (which is only relevant in helping the admin decide if and how to request an update).

Renaming 4.14.1-both-available-and-conditional to
4.14.1-all-recommended, because it has no Recommended!=True
conditionalUpdates.  And all-recommended seems like a more clear way
to express that, while "both-available-and-conditional" could sound
like "some are conditional and Recommended!=True".

Also bringing in some new examples from a 'launch 4.12.16 azure' Cluster
Bot install, followed by:

  $ oc adm upgrade channel stable-4.13

I collected the examples with:

  $ oc get clusterversion version -o yaml >pkg/cli/admin/upgrade/recommend/examples/4.12.16-...-cv.yaml

one early after the install, when the cluster-version operator was
slowly populating its cache of PromQL evaluations (later fixed in
4.12.43 [1]).  I've named the early-after-install
longest-not-recommended, because it has long hops that are not
recommended because risks have not been evaluated yet.  I've named the
while-after-install longest-recommended, because risks have been
evaluated.

[1]: https://issues.redhat.com/browse/OCPBUGS-22408?focusedId=23491059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-23491059
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 30, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 30, 2024

@wking: This pull request references OTA-1272 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

Building on the CI that merged in #1885.

We had been showing all the recommended updates in decreasing order with --include-not-recommended to see all the updates-with-assessed-risks in decreasing order. But sometimes users want to update to the longest-hop, even if there are known risks. Or they want to read about the assessed risks, in case there's something they can do to their cluster to mitigate a currently-assessed risk before kicking off the update. This commit adjusts the output to order roughly by release freshness For example, for a 4.y cluster in a 4.(y+1) channel:

  • 4.(y+1).tip
  • 4.(y+1).(tip-1)
  • 4.y.tip
  • 4.y.(tip-1)
  • ...

Because users are more likely to care about 4.(y+1).tip, even if it has assessed risks, than they are to care about 4.y.reallyOld, even if it doesn't have assessed risks.

I'm showing two releases per z-stream at the moment, with --show-outdated-releases to see all the results ("releases" to leave the door open to having things like a single-arch and a multi-arch target with the same version string). And I'm dropping --include-not-recommended, now that we are only filtering by age.

I'm also dropping "where are we now?" details, like Progressing=False, because those are less relevant for folks who are trying to pick which release they want to target next. I still call out Failing=True and Progressing=True, because oc adm upgrade ... has client-side checks for those. Folks who are concerned with how their cluster is currently doing can run oc adm upgrade ... or the tech-preview oc adm upgrade status ...).

I'm also dropping ReleaseAccepted, which is in a weird space after the admin requests an update but before the cluster-version operator accepts the retarget request. Because oc adm upgrade is the subcommand used to request an update (--to and --to-image), I think that's the appropriate place to surface that condition (and it already does), so we don't need to surface that condition in the recommend subcommand (which is only relevant in helping the admin decide if and how to request an update).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 30, 2024
@wking wking force-pushed the oc-update-recommend-by-freshness branch from 3a0601a to 46d8a1c Compare September 30, 2024 17:55
@DavidHurta
Copy link
Contributor

/cc

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 1, 2024

@DavidHurta: GitHub didn't allow me to request PR reviews from the following users: DavidHurta.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@petr-muller
Copy link
Member

/cc

@openshift-ci openshift-ci bot requested a review from petr-muller October 2, 2024 12:06
@DavidHurta
Copy link
Contributor

/cc

@openshift-ci openshift-ci bot requested a review from DavidHurta October 4, 2024 11:03
Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several minor code suggestions inline, most are fine to ignore.

Thinking out loud about the proposed output:

  • Feels like the section that contains available paths should be more prominently separated. Right now there's only the Updates: "header" but it is small and drowns in clutter.
  • I think we should help the users in understanding that the full list now has version "subsections". There's a little hint about that in the "... and X older 4.13 updates" line that suggests that the above lines are 4.13 updates but I don't think it's enough.
  • I think interleaving the non-recommended updates as 4-line - minimum - paragraphs with recommended updates rendered as two-item lines formatted as a table is not a great UX, especially when we start interleaving many of them like in the 4.12.16-longest-recommended.show-outdated-releases-output example. I wonder if we could come with some unified presentation for both types (maybe a table with just the version/recommended status/known risk reason in the short form, and a paragraph-per-item for both types of them in long form?

I don't think these need to block this specific PR because we're building the new command in stealth mode and can improve UX as we go, but I wonder if we have a UX target eventually.

Comment on lines 460 to 485
Version: 4.13.0
Image: quay.io/openshift-release-dev/ocp-release@sha256:74b23ed4bbb593195a721373ed6693687a9b444c97065ce8ac653ba464375711
Reason: MultipleReasons
Message: Exposure to AcceleratedNetworkingRace is unknown due to an evaluation failure: client-side throttling: only 8m30.898043529s has elapsed since the last match call completed for this cluster condition backend; this cached cluster condition request has been queued for later execution
Adding a new worker node may fail for clusters running on Azure with Accelerated Networking. https://issues.redhat.com/browse/OPNET-479

Exposure to MultiNetworkAttachmentsWhereaboutsVersion is unknown due to an evaluation failure: client-side throttling: only 8m30.898047129s has elapsed since the last match call completed for this cluster condition backend; this cached cluster condition request has been queued for later execution
Upgrade can get stuck on clusters that have multiple network attachments. https://access.redhat.com/solutions/7024726

Exposure to PerformanceProfilesCPUQuota is unknown due to an evaluation failure: client-side throttling: only 8m30.898050029s has elapsed since the last match call completed for this cluster condition backend; this cached cluster condition request has been queued for later execution
Pods may not execute properly because of CPU quota issues on clusters using PerformanceProfiles. https://issues.redhat.com/browse/OCPNODE-1705

New PersistentVolumes are created bound to an invalid symlink instead of the correct one and thus fail once updated to 4.13.4 or later as the invalid symlink dispears. https://issues.redhat.com/browse/COS-2349

Pods scheduled to nodes running a RHCOS/kernel version with a known defect may fail to start with a CreateContainerError. https://issues.redhat.com/browse/COS-2437

VERSION IMAGE
4.12.64 quay.io/openshift-release-dev/ocp-release@sha256:669170342e3ae3456b2eb00dd0c45cd817d62af310e0aa2bf214f1da65fae9d0
4.12.63 quay.io/openshift-release-dev/ocp-release@sha256:5db6f8dd1db6b9d07dacfa74574a38a6e518145a3c0ab5d895e9c89e029a39e4
4.12.61 quay.io/openshift-release-dev/ocp-release@sha256:faa201ea7790bb101de7b9b6f01543136e8c589463b8437a10678da0cdd0197a
4.12.60 quay.io/openshift-release-dev/ocp-release@sha256:22174072f8aaef46f80bded25d5a2bdfa6fd8bb01fc5242a07e550f132cafc5a
4.12.59 quay.io/openshift-release-dev/ocp-release@sha256:8908e5a46f106e7520faa5f5ee679033c4c2f0026f1b8a339e6b3ea4becdd969
4.12.58 quay.io/openshift-release-dev/ocp-release@sha256:b622d947ade4a13754e9d9f6803107f1b321eecd925c897f3c12de1aafccbdc5
4.12.57 quay.io/openshift-release-dev/ocp-release@sha256:ebd20cd66e0bbb5bcd7d16559ff4856918b7172e29d6a7bd34d8cc49a556b8f7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interleaving a multi-paragraph item for a single 4.13.0 release with a table is confusing to me - I know this is a full output but this could happen even in normal output - one item would be heavy like this and then we would have a single-item VERSION/IMAGE table with a version without a risk? I think we should try finding a more consistent presentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your volume concerns (also these). Follow-up work planned in OTA-1273 will add --to and give us the option to avoid rendering messages for some releases. For example, we could go down to one-line-per-target tables, and have a new column for matching risk slugs. But for now, without --to, I think the volume is acceptable while we step towards a more complete solution. Because part of the push here is to put at least the longest-hop risk message in front of the admin, without them having to opt-in to see that text.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and now #1897 is up, and part of what it does is drop the multi-line entries from --show-outdated-releases, which hopefully addresses this concern :).

@wking wking force-pushed the oc-update-recommend-by-freshness branch from 46d8a1c to 07d7b03 Compare October 4, 2024 16:33
We had been showing all the recommended updates in decreasing order
with --include-not-recommended to see all the
updates-with-assessed-risks in decreasing order.  But sometimes users
want to update to the longest-hop, even if there are known risks.  Or
they want to read about the assessed risks, in case there's something
they can do to their cluster to mitigate a currently-assessed risk
before kicking off the update.  This commit adjusts the output to
order roughly by release freshness For example, for a 4.y cluster in a
4.(y+1) channel:

  4.(y+1).tip
  4.(y+1).(tip-1)
  4.y.tip
  4.y.(tip-1)
  ...

Because users are more likely to care about 4.(y+1).tip, even if it
has assessed risks, than they are to care about 4.y.reallyOld, even if
it doesn't have assessed risks.

I'm showing two releases per z-stream at the moment, with
--show-outdated-releases to see all the results ("releases" to leave
the door open to having things like a single-arch and a multi-arch
target with the same version string).  And I'm dropping
--include-not-recommended, now that we are only filtering by age.

I'm also dropping "where are we now?" details, like Progressing=False,
because those are less relevant for folks who are trying to pick which
release they want to target next.  I still call out Failing=True and
Progressing=True, because 'oc adm upgrade ...' has client-side checks
for those ("the cluster is experiencing an error reconciling" and "the
cluster is already upgrading", with --allow-upgrade-with-warnings).
Folks who are concerned with how their cluster is currently doing can
run 'oc adm upgrade ...' or the tech-preview 'oc adm upgrade status ...').

I'm also dropping ReleaseAccepted, which is in a weird space after the
admin requests an update but before the cluster-version operator
accepts the retarget request.  Because 'oc adm upgrade' is the
subcommand used to request an update (--to and --to-image), I think
that's the appropriate place to surface that condition (and it already
does), so we don't need to surface that condition in the 'recommend'
subcommand (which is only relevant in helping the admin decide if and
how to request an update).
@wking wking force-pushed the oc-update-recommend-by-freshness branch from 07d7b03 to d8cac25 Compare October 4, 2024 17:06
Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/meow

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 4, 2024

@petr-muller: cat image

Details

In response to this:

/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@petr-muller
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 4, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member Author

wking commented Oct 4, 2024

e2e-agnostic-ovn-cmd:

: [bz-etcd][invariant] alert/etcdHighCommitDurations should not be at or above info	0s
{  etcdHighCommitDurations was at or above info for at least 12s on platformidentification.JobType{Release:"4.18", FromRelease:"", Platform:"azure", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 5m56s, firing for 12s:

Oct 04 18:59:12.584 - 12s   W namespace/openshift-etcd node/10.0.0.4:9979 pod/etcd-ci-op-lz2hjldb-67d47-2h66q-master-0 alert/etcdHighCommitDurations alertstate/firing severity/warning ALERTS{alertname="etcdHighCommitDurations", alertstate="firing", endpoint="etcd-metrics", instance="10.0.0.4:9979", job="etcd", namespace="openshift-etcd", pod="etcd-ci-op-lz2hjldb-67d47-2h66q-master-0", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}}

But my tech-preview command isn't overheating etcd, that's between etcd and Azure. Try again:

/test e2e-agnostic-ovn-cmd

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 4, 2024

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 026e481 into openshift:master Oct 4, 2024
@wking wking deleted the oc-update-recommend-by-freshness branch October 4, 2024 22:14
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-cli
This PR has been included in build openshift-enterprise-cli-container-v4.18.0-202410042310.p0.g026e481.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-tools
This PR has been included in build ose-tools-container-v4.18.0-202410042310.p0.g026e481.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-deployer
This PR has been included in build openshift-enterprise-deployer-container-v4.18.0-202410042310.p0.g026e481.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cli-artifacts
This PR has been included in build ose-cli-artifacts-container-v4.18.0-202410042310.p0.g026e481.assembly.stream.el9.
All builds following this will include this PR.

wking added a commit to wking/oc that referenced this pull request Oct 14, 2024
This allows us to drop the detailed, multi-line entries from the long
--show-outdated-releases output.  Now that form just gives us an
overview of releases with 'reason' slugs for any updates with assessed
issues.  Cluster administrators who see something in that overview
that they are interested in can make a subsequent call with '--to
VERSION' to get the long-form details on that particular target.

I'm keeping the long-form output for release-exposed risks in the
default rendering though, because part of the motivation behind
d8cac25 (pkg/cli/admin/upgrade/recommend: Show most-recent update
options, 2024-09-26, openshift#1889) is to surface those longest-hop risk
messages for cluster administrators, in case the issue is something
like EvaluationFailed, which says more about the current cluster state
than it does about the health of the update.

Because the '--to 4.12.51' runs fail for some of our ClusterVersion
example fixtures, I've adjusted examples_test.go to catch and compare
output when opts.Run fails.
wking added a commit to wking/oc that referenced this pull request Oct 14, 2024
…leases

This allows us to drop the detailed, multi-line entries from the long
--show-outdated-releases output.  Now that form just gives us an
overview of releases with 'reason' slugs for any updates with assessed
issues.  Cluster administrators who see something in that overview
that they are interested in can make a subsequent call with
'--version VERSION' to get the long-form details on that particular
target.

I'm keeping the long-form output for release-exposed risks in the
default rendering though, because part of the motivation behind
d8cac25 (pkg/cli/admin/upgrade/recommend: Show most-recent update
options, 2024-09-26, openshift#1889) is to surface those longest-hop risk
messages for cluster administrators, in case the issue is something
like EvaluationFailed, which says more about the current cluster state
than it does about the health of the update.

Because the '--version 4.12.51' runs fail for some of our
ClusterVersion example fixtures, I've adjusted examples_test.go to
catch and compare output when opts.Run fails.
@JianLi-RH
Copy link

hi @wking could you let me know how to enable the new option --show-outdated-releases? I tried set OC_ENABLE_CMD_UPGRADE_STATUS=true and set OC_ENABLE_CMD_UPGRADE_RECOMMEND=true, but still get error:

 [jianl@jianl-thinkpadt14gen4 418]$ oc adm upgrade --show-outdated-releases
error: unknown flag: --show-outdated-releases
See 'oc adm upgrade --help' for usage.
[jianl@jianl-thinkpadt14gen4 418]$

And how subcommand recommend works, what's the expected output of oc adm upgrade recommend?

Thanks!

wking added a commit to wking/oc that referenced this pull request Oct 22, 2024
…leases

This allows us to drop the detailed, multi-line entries from the long
--show-outdated-releases output, and the long IMAGE pullspecs from the
tabular output.  Now those forms just give us an overview of releases
with 'reason' slugs for any updates with assessed issues.  Cluster
administrators who see something in that overview that they are
interested in can make a subsequent call with '--version VERSION' to
get the long-form details on that particular target.

I'm keeping the long-form output for release-exposed risks in the
default rendering though, because part of the motivation behind
d8cac25 (pkg/cli/admin/upgrade/recommend: Show most-recent update
options, 2024-09-26, openshift#1889) is to surface those longest-hop risk
messages for cluster administrators, in case the issue is something
like EvaluationFailed, which says more about the current cluster state
than it does about the health of the update.

Because the '--version 4.12.51' runs fail for some of our
ClusterVersion example fixtures, I've adjusted examples_test.go to
catch and compare output when opts.Run fails.
wking added a commit to wking/oc that referenced this pull request Feb 20, 2025
…-outdated-releases

Catching up with:

* 0cd2e2c (pkg/cli/admin/upgrade/recommend/examples: Update
  examples, 2024-09-26, openshift#1889)'s rename from
  4.14.1-both-available-and-conditional to 4.14.1-all-recommended.
* d8cac25 (pkg/cli/admin/upgrade/recommend: Show most-recent update
  options, 2024-09-26, openshift#1889)'s pivot from --include-not-recommended
  to --show-outdated-releases.
* c1ef328 (pkg/cli/admin/upgrade/recommend: Add --version option
  for specific releases, 2024-10-14, openshift#1897) addition of --version.
wking added a commit to wking/oc that referenced this pull request Feb 25, 2025
…-outdated-releases

Catching up with:

* 0cd2e2c (pkg/cli/admin/upgrade/recommend/examples: Update
  examples, 2024-09-26, openshift#1889)'s rename from
  4.14.1-both-available-and-conditional to 4.14.1-all-recommended.
* d8cac25 (pkg/cli/admin/upgrade/recommend: Show most-recent update
  options, 2024-09-26, openshift#1889)'s pivot from --include-not-recommended
  to --show-outdated-releases.
* c1ef328 (pkg/cli/admin/upgrade/recommend: Add --version option
  for specific releases, 2024-10-14, openshift#1897) addition of --version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants