Skip to content

Conversation

@skopacz1
Copy link
Contributor

@skopacz1 skopacz1 commented Aug 30, 2023

OSDOCS-6630

Versions: 4.11+

This PR further refines the documentation for how cluster updates work, since there were a lot of feedback items that were deferred when the original documentation was implemented.

QE review:

  • QE has approved this change.

Preview: How cluster updates work

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 30, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 30, 2023

@skopacz1: This pull request references OSDOCS-6630 which is a valid jira issue.

Details

In response to this:

OSDOCS-6630

Versions: 4.10+

This PR further refines the documentation for how cluster updates work, since there were a lot of feedback items that were deferred when the original documentation was implemented.

QE review:

  • QE has approved this change.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Aug 30, 2023
@skopacz1
Copy link
Contributor Author

Certain conditions can prevent updates from proceeding.
These conditions are either determined by the CVO itself, or reported by individual cluster Operators that detect some details about the cluster that the Operator considers problematic for the update.

// to do: potentially add an example of a precondition to the bullet above.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in reference to this discussion thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think coming up with a specific example would be helpful, I'll try to provide one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one possible option would be LowDesiredVersion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you said that LowDesiredVersion only applies to 4.14 and later: unless there's another good example that applies to earlier versions, I think I'll save the inclusion of this specific example for another PR so I can merge this current PR into all live versions of the doc

While the additional update actions take place, these cluster Operators temporarily set their `Progressing` condition to `True`.
====

// to do: potentially reword the note above to clarify that specific resources are being applied at one time, and not necessarily all the resources for that component.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in reference to this discussion thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point about making the doc use "Cluster Operator" (an OpenShift component == piece of software with a bunch of manifests) and ClusterOperator (is a cluster resource created with a single manifest) consistently (and tuning the descriptions around) could be a useful improvement.

Message: Nodes with substantial numbers of containers and CPU contention may not reconcile machine configuration https://bugzilla.redhat.com/show_bug.cgi?id=2111817#c22
----

// to do: determine whether the rest of the lines in this module should still be included, since this is pretty in-depth even for this sort of descriptive doc, according to Trevor.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in reference to this discussion thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if doc-tooling allows this, but a possible compromise between my desire to focus on the abstract description and @petr-muller's desire to give folks a peek under the hood would be to say that everything in the above oc adm upgrade output is sourced from ClusterVersion, and link folks over to the API docs where they can learn about the structure of status.availableUpdates on their own.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'm leaning towards keeping the two examples because I think they convey the oc adm upgrade output is sourced from ClusterVersion idea on an example which I believe works slightly better than abstract descriptions, but I could live with what Trevor proposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with a separate "how does oc adm upgrade work?" with the paired examples that demonstrate that currently it's just a pretty-printer for ClusterVersion status. That is interesting information for folks who think the command is too magical. I just don't think that delving into that implementation belongs in this Evaluation of update availability section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discourage folks to run commands like oc get clusterversion version -o json | jq '.status.availableUpdates' unless there is no way they can get some information. If the information can retrieved through oc command then we do not need to talk about directly querying clusterversion . Because we do not users modifying clusterversion or any underlying resources. If they get in to habit of doing that it is risky for multiple reasons. The UX is hard. They need to parse information which might not be useful for them. They accidentally modify a resource which might cause issues because QE does not test by directly modifying or querying the resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discourage folks to run commands like ... unless there is no way they can get some information.

It's hard to make people understand how upgrade works without actually showing them the guts of ClusterVersion. Context is important. These examples are presented in the context of "run command to see internals / how things work", not "run command to get the list of available updates".

I just don't think that delving into that implementation belongs in this Evaluation of update availability section

I agree with this. My synthesis of all these opinions is that we could have a short section on the ClusterVersion object itself: its existence, the fact that you should never modify it directly, that CVO operates over it and that oc adm upgrade pretty prints it. Then we could clean the other sections from mentioning the resource.

@ocpdocs-previewbot
Copy link

ocpdocs-previewbot commented Aug 30, 2023

🤖 Updated build preview is available at:
https://64077--docspreview.netlify.app

Build log: https://circleci.com/gh/ocpdocs-previewbot/openshift-docs/28307

@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 30, 2023

@skopacz1: This pull request references OSDOCS-6630 which is a valid jira issue.

Details

In response to this:

OSDOCS-6630

Versions: 4.10+

This PR further refines the documentation for how cluster updates work, since there were a lot of feedback items that were deferred when the original documentation was implemented.

QE review:

  • QE has approved this change.

Preview: How cluster updates work

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@skopacz1 skopacz1 changed the title OSDOCS-6630: second iteration of how updates work doc OSDOCS#6630: second iteration of how updates work doc Aug 30, 2023
@openshift-ci-robot openshift-ci-robot removed the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 30, 2023
@openshift-ci-robot
Copy link

@skopacz1: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

Details

In response to this:

OSDOCS-6630

Versions: 4.10+

This PR further refines the documentation for how cluster updates work, since there were a lot of feedback items that were deferred when the original documentation was implemented.

QE review:

  • QE has approved this change.

Preview: How cluster updates work

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Message: Nodes with substantial numbers of containers and CPU contention may not reconcile machine configuration https://bugzilla.redhat.com/show_bug.cgi?id=2111817#c22
----

// to do: determine whether the rest of the lines in this module should still be included, since this is pretty in-depth even for this sort of descriptive doc, according to Trevor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discourage folks to run commands like oc get clusterversion version -o json | jq '.status.availableUpdates' unless there is no way they can get some information. If the information can retrieved through oc command then we do not need to talk about directly querying clusterversion . Because we do not users modifying clusterversion or any underlying resources. If they get in to habit of doing that it is risky for multiple reasons. The UX is hard. They need to parse information which might not be useful for them. They accidentally modify a resource which might cause issues because QE does not test by directly modifying or querying the resources.

@openshift-ci openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 11, 2023
One of the resources that the Cluster Version Operator (CVO) monitors is `ClusterVersion`.

`ClusterVersion` is a custom resource object that contains information relating to the cluster's version, such as the current and desired versions of the cluster.
When the CVO observes that the desired version does not match the current version in the `ClusterVersion` resource, it attempts to initiate an update to reconcile the cluster with this new desired state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we de-emphasize updates here with something like:

The CVO continually reconciles the cluster with the target state declared in ClusterVersion spec. When the desired release differs from the current one, that reconciliation updates the cluster.

to make it clear that "updating or not?" is a subset of the reconciliation the CVO is always doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this now (with more coffee in my system): is sentence 1 meant to cover "reconciliation in general" and sentence 2 meant to cover "updates as a subset of that reconciliation"?

When the CVO observes that the desired version does not match the current version in the `ClusterVersion` resource, it attempts to initiate an update to reconcile the cluster with this new desired state.


//to-do: this might be heading overload, consider deleting this heading if the context switch from the previous paragraph to this content is smooth enough to not require one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one way to structure would be a generic "consume spec, reconcile, report in status" to underline how we match the usual Kubernetes pattern. Another way to structure would be to have a few sections:

  • Reconciling the currently accepted target release, which effectively consumes status.desired and reports via Progressing, Failing, etc.
  • Providing next-hop advice, which consumes spec.upstream and channel and reports in status.availableUpdates and conditionalUpdates and RetrievedUpdates.
    • Also in this line, we're consuming ClusterOperator Upgradeable and producing ClusterVersion Upgradeable.
  • Accepting a proposed next hop, which consumes spec.desiredUpdate, status.availableUpdates, conditionalUpdates, Upgradeable, and release signatures, and reports in status.desired and RetrievePayload.

Each of those touches up against admin activity. During an update, an admin will bump up against all of those controller loops. Outside of updates, admins will mostly care if there are issues reconciling the currently accepted target release, until they start planning and preparing for their next round of updates.

I'm personally agnostic about whether it's easier to explain ClusterVersion as a generic Kube spec/status resource that happens to be about cluster reconciliation and updates, or if it's easier to explain ClusterVersion as interacting with a series of controllers broken down by use-case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the purpose of the PR is to tie up loose ends, I'm thinking this new loose end is worth some deliberation before it's implemented. I like the structure as it is now, so if you're fine with it as well, I think I'll save this feedback for a v3 of this doc

The CVO continuously evaluates its cluster characteristics against the conditional risk information for each conditional update. If the CVO finds that the cluster matches the criteria, the CVO stores this information in the `conditionalUpdates` field of its `ClusterVersion` resource.
If the CVO finds that the cluster does not match the risks of an update, or that there are no risks associated with the update, it stores the target version in the `availableUpdates` field of its `ClusterVersion` resource.

The user interface, either the web console or the OpenShift CLI (`oc`), presents this information in sectioned headings to the administrator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to link out from this concept section to "here's the docs for driving those interfaces to consume this infomation"? Or do only do procedure -> context links, and not context -> proceedure links?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky, do you mean linking to the CLI and web console update procedures, where there's a step or two showing how to view the available updates?

I can't link inline in this module, it would have to be in an additional resources list at the end of the section. And if you mean to link to these pages, the context might be lost by the time they get to the additional resources section and see links to update procedures. Maybe I can preface the links with "To learn more about viewing available updates, see the following....".

@petr-muller
Copy link
Member

LGTM

One of the resources that the Cluster Version Operator (CVO) monitors is `ClusterVersion`.

The `ClusterVersion` custom resource object is the primary interface for managing the CVO.
Cluster administrators and other controllers can declare their desired state through `ClusterVersion` `spec` and observe how the CVO is delivering those requests in `status`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly we want to communicate to users through this paragraph?

I think I understand the intent behind this paragraph, but I do not think we are communicating it in a way which is easy to understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is with the text that says "cluster administrators can declare their desired state through clusterversion. @wking WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is just pointing out that ClusterVersion follows Kubernetes' usual spec/status pattern. I think it's worth pointing out that the intended data flows are:

  • Admin desires -> ClusterVersion spec declarations -> CVO attempts to deliver the desired state.
  • CVO has opinions on current state and progress -> ClusterVersion status reporting -> admin clarity on current situation.

but I'm open to rephrasing if we can express those two directions of data flow more clearly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my suggested text.

OpenShift components and administrators can communicate/interact with CVO through ClusterVersion object. The desired CVO state should be declared through the Clusterversion object and the current CVO state can be seen through status of the ClusterVersion object.

Note: We do not suggest users to directly modify the ClusterVersion object. They should use the standard interfaces e.g. CLI and web console to declare their desired update etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with that suggested text, I can implement it if there's no opposition to it.

@skopacz1
Copy link
Contributor Author

@shellyyang1989 could you PTAL when you have a chance? Thanks!

@shellyyang1989
Copy link

LGTM

@skopacz1
Copy link
Contributor Author

/label peer-review-needed

@kelbrown20 kelbrown20 added peer-review-done Signifies that the peer review team has reviewed this PR and removed peer-review-in-progress Signifies that the peer review team is reviewing this PR peer-review-needed Signifies that the peer review team needs to review this PR labels Oct 16, 2023
@skopacz1
Copy link
Contributor Author

/label merge-review-needed

@openshift-ci openshift-ci bot added the merge-review-needed Signifies that the merge review team needs to review this PR label Oct 17, 2023
Copy link
Contributor

@jldohmann jldohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jldohmann jldohmann merged commit 92e47a3 into openshift:main Oct 17, 2023
@jldohmann
Copy link
Contributor

/cherrypick enterprise-4.14

@jldohmann
Copy link
Contributor

/cherrypick enterprise-4.13

@jldohmann
Copy link
Contributor

/cherrypick enterprise-4.12

@jldohmann
Copy link
Contributor

/cherrypick enterprise-4.11

@openshift-cherrypick-robot

@jldohmann: new pull request created: #66397

Details

In response to this:

/cherrypick enterprise-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@jldohmann: #64077 failed to apply on top of branch "enterprise-4.13":

Applying: OSDOCS-6630: second iteration of how updates work doc
Using index info to reconstruct a base tree...
M	updating/understanding_updates/how-updates-work.adoc
Falling back to patching base and 3-way merge...
Auto-merging updating/understanding_updates/how-updates-work.adoc
CONFLICT (content): Merge conflict in updating/understanding_updates/how-updates-work.adoc
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OSDOCS-6630: second iteration of how updates work doc
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@jldohmann: #64077 failed to apply on top of branch "enterprise-4.12":

Applying: OSDOCS-6630: second iteration of how updates work doc
Using index info to reconstruct a base tree...
M	modules/update-manifest-application.adoc
M	updating/understanding_updates/how-updates-work.adoc
Falling back to patching base and 3-way merge...
Auto-merging updating/understanding_updates/how-updates-work.adoc
CONFLICT (content): Merge conflict in updating/understanding_updates/how-updates-work.adoc
Auto-merging modules/update-manifest-application.adoc
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OSDOCS-6630: second iteration of how updates work doc
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@jldohmann: #64077 failed to apply on top of branch "enterprise-4.11":

Applying: OSDOCS-6630: second iteration of how updates work doc
Using index info to reconstruct a base tree...
M	modules/update-manifest-application.adoc
M	updating/understanding_updates/how-updates-work.adoc
Falling back to patching base and 3-way merge...
Auto-merging updating/understanding_updates/how-updates-work.adoc
CONFLICT (content): Merge conflict in updating/understanding_updates/how-updates-work.adoc
Auto-merging modules/update-manifest-application.adoc
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OSDOCS-6630: second iteration of how updates work doc
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jldohmann
Copy link
Contributor

/cherrypick enterprise-4.13

@openshift-cherrypick-robot

@jldohmann: #64077 failed to apply on top of branch "enterprise-4.13":

Applying: OSDOCS-6630: second iteration of how updates work doc
Using index info to reconstruct a base tree...
M	updating/understanding_updates/how-updates-work.adoc
Falling back to patching base and 3-way merge...
Auto-merging updating/understanding_updates/how-updates-work.adoc
CONFLICT (content): Merge conflict in updating/understanding_updates/how-updates-work.adoc
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OSDOCS-6630: second iteration of how updates work doc
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jldohmann jldohmann removed the merge-review-needed Signifies that the merge review team needs to review this PR label Oct 17, 2023
@jldohmann
Copy link
Contributor

@skopacz1 it looks like all the auto CPs to every branch but 4.14 failed, so you'll need to manually CP. Please lmk if you have any questions and feel free to ping me once those CPs are up and i'll merge them 🙂 thanks!

jldohmann added a commit that referenced this pull request Oct 17, 2023
jldohmann added a commit that referenced this pull request Oct 17, 2023
jldohmann added a commit that referenced this pull request Oct 17, 2023
@skopacz1 skopacz1 deleted the OSDOCS-6630 branch October 31, 2023 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

branch/enterprise-4.11 branch/enterprise-4.12 branch/enterprise-4.13 branch/enterprise-4.14 peer-review-done Signifies that the peer review team has reviewed this PR size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants