OSDOCS#2429: documenting how updates work #60091

skopacz1 · 2023-05-16T15:18:57Z

Version(s):
4.10+

This PR introduces detailed content about how OpenShift Updates work in the background. This PR has made the following changes to the Updating clusters documentation:

The previous Understanding OpenShift updates assembly has been renamed to "Introduction to OpenShift updates" and has been nested within a new subsection titled "Understanding OpenShift Updates".
A summarized description of how updates work has been added to the top of the "intro to updates" assembly
A new assembly titled "How cluster updates work" has been created, containing more detailed information about each major aspect of the OCP update process.

Link to docs preview:

QE review:

QE has approved this change.

openshift-ci-robot · 2023-05-16T15:19:01Z

@skopacz1: This pull request references OSDOCS-2429 which is a valid jira issue.

Details

In response to this:

OSDOCS-2429

Version(s):
4.10+

This PR introduces detailed content about how OpenShift Updates work in the background. This PR has made the following changes to the Updating clusters documentation:

The previous Understanding OpenShift updates assembly has been renamed to "Introduction to OpenShift updates" and has been nested within a new subsection titled "Understanding OpenShift Updates".

A summarized description of how updates work has been added to the top of the "intro to updates" assembly

A new assembly titled "How cluster updates work" has been created, containing more detailed information about each major aspect of the OCP update process.

Link to docs preview:

QE review:

QE has approved this change.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

modules/update-mco-process.adoc

images/update-runlevels.png

openshift-ci-robot · 2023-05-16T19:09:42Z

@skopacz1: This pull request references OSDOCS-2429 which is a valid jira issue.

Details

In response to this:

OSDOCS-2429

Version(s):
4.10+

This PR introduces detailed content about how OpenShift Updates work in the background. This PR has made the following changes to the Updating clusters documentation:

The previous Understanding OpenShift updates assembly has been renamed to "Introduction to OpenShift updates" and has been nested within a new subsection titled "Understanding OpenShift Updates".

A summarized description of how updates work has been added to the top of the "intro to updates" assembly

A new assembly titled "How cluster updates work" has been created, containing more detailed information about each major aspect of the OCP update process.

Link to docs preview:

Introduction to OpenShift Updates

How cluster updates work

QE review:

QE has approved this change.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller · 2023-05-18T11:29:54Z

/cc

skopacz1 · 2023-05-18T17:02:50Z

@shellyyang1989 PTAL, thank you!

evakhoni · 2023-05-22T14:51:05Z

/cc

modules/update-evaluate-availability.adoc

modules/update-manifest-application.adoc

modules/update-mco-process.adoc

evakhoni · 2023-05-23T16:33:17Z

modules/update-mco-process.adoc

+= Machine Config Operator node updates
+The Machine Config Operator (MCO) applies a new machine configuration to each control plane and compute node. During this process, the MCO performs the following sequential actions in the cluster:
+
+. Cordon and drain all of the nodes


should we mention one by one?
also, a paused machine pool in EUS-to-EUS for example, is an exception.

Could you clarify the exact order of operations here? Is it "Cordon and drain each node, one at a time, and then update the OS of each node one at a time, and then reboot one a time, etc."? Or is a single node cordoned, drained, updated, rebooted, uncordoned, and scheduled before the next node initiates the same process?

It's not one at a time, it's configurable for each pool (.spec.maxUnavailable). One is the default behavior.

Or is a single node cordoned, drained, updated, rebooted, uncordoned, and scheduled before the next node initiates the same process?

^^^ this is the process when maxUnavailable is one

I think there's a card for MCO team to provide detailed documentation on their part of the upgrade

It's not one at a time, it's configurable for each pool

u're right! in any case, never All of them at once afaik.

I think there's a card for MCO team

hmm.. if they end up having a detailed explanation somewhere in another page, perhaps we eventually just have a link to that page in this article?

...one pool of cordoned nodes at a time...

MachineConfigPools have a maxUnavailable knob, defaulting to 1. When the machine-config controller notices that a MachineConfigPool spec calls for an update, and fewer than maxUnavailable Nodes in that pool are unavailable, it will cordon that Node, take some actions which may, but do not necessarily, include draining and/or rebooting, and uncordon the Node. For OCP updates, there are almost always RHCOS updates getting rolled out, and those need the full drain and reboot. But every once and a while we'll ship the same RHCOS and relevant MachineConfigs in two separate 4.y.z, and then you can have an OCP update that requires no drains or reboots. Unclear to me how much of this context is worth including. Personally, I'm fine with anything including:

Don't even mention the machine-config operator.

Mention that the machine-config operator exists, that it updates (unpaused) MachineConfigPools, and that that can take some time.

Explaining that the time MachineConfigPools can take depends (almost always) on the time it takes to drain them, and also on the number of nodes you've ok'ed draining in parallel via maxUnavailable.

Explaining how to figure out which MachineConfigPool updates will trigger drains, and which will trigger reboots, and which will trigger neither.

But it's "up to maxUnavailable nodes from a MachineConfigPool" being cordoned at any given time, not the whole pool.

Ah, reading your comment here I now realize that your "one pool of cordoned nodes" means "up to maxUnavailable nodes from a MachineConfigPool". I'd rather avoid the word "pool" for that, to keep folks from falling into the same misconception I did and assume that when you said "pool of ... nodes" you mean the whole MachineConfigPool.

agree with Trevor, the wording "pool" may mislead in this context. back to line 10 "Cordon and drain all of the nodes" we may rather remove the word "all"

If there aren't any particular rewording suggestions, I can use "group" in place of "pool". I can also remove the word "all" to further avoid confusion.

Also, I will save further refinement along the lines of Trevor's first comment for a second iteration of this doc.

evakhoni · 2023-05-23T16:37:44Z

modules/update-mco-process.adoc

+When a node is cordoned, workloads cannot be scheduled to it.
+====
+
+The time to complete this process depends on several factors including the node and infrastructure configuration. This process might take 5 or more minutes to complete per node.


wonder what's the maximal time acceptable for the task.. are we omitting intentionally?

I think there is deeper context to this from when this content was first made for the "Understanding update duration" page.

@LalatenduMohanty, @petr-muller, or @wking might know, but my memory/understanding is that it's intentionally omitted to communicate that: "This process may take a long time but there are too many variations and factors for us to guarantee a specific timeframe this would be completed by". I could be wrong though

It's very dependent on circumstances - user workloads specifically. Your understanding is correct, it's hard to guarantee. It is also possible to configure the workloads in a way that completely stucks the upgrade, unfortunately.

hmm.. then it make sense to omit since we cannot guarantee... (unless there's some high mark that we can actually consider as: ok it took way too much time at this point, its probably stuck and not just taking longer than ususal)

The machine-config operator will fire a warning MCDDrainError after an hour. But 🤷, many workloads like the Prow CI jobs we run as CVO presubmits take multiple hours in unevictable, PodDisruptionBudget-protected pods, and that's normal too. I would rather not get opinionated about how long "surprisingly long" is, and leave that up to the cluster administrator and workload administrators to negotiate between themselves.

modules/update-process-workflow.adoc

updating/understanding_updates/intro-to-updates.adoc

modules/update-evaluate-availability.adoc

shellyyang1989 · 2023-05-24T09:54:45Z

modules/update-mco-process.adoc

+
+:_content-type: CONCEPT
+[id="mco-update-process_{context}"]
+= Machine Config Operator node updates


@rioliu-rh could you please take a look at the mco update section? Thank you.

s/Machine Config Operator node updates/Understanding how the Machine Config Operator updates nodes
Currently, it sounds like we are updating the Machine Config Operator node.

modules/update-process-workflow.adoc

skopacz1 · 2023-06-23T13:26:55Z

/label merge-review-needed

petr-muller · 2023-06-23T13:41:16Z

kcarmichael08 · 2023-06-23T17:20:48Z

/cherrypick enterprise-4.10

kcarmichael08 · 2023-06-23T17:21:21Z

/cherrypick enterprise-4.11

kcarmichael08 · 2023-06-23T17:21:29Z

/cherrypick enterprise-4.12

openshift-cherrypick-robot · 2023-06-23T17:21:29Z

@kcarmichael08: #60091 failed to apply on top of branch "enterprise-4.10":

Applying: OSDOCS-2429: documenting how updates work
Using index info to reconstruct a base tree...
M	_topic_maps/_topic_map.yml
M	modules/update-service-overview.adoc
M	security/container_security/security-hosts-vms.adoc
M	updating/index.adoc
M	updating/understanding-openshift-updates.adoc
M	updating/updating-restricted-network-cluster/restricted-network-update-osus.adoc
Falling back to patching base and 3-way merge...
Auto-merging updating/updating-restricted-network-cluster/restricted-network-update-osus.adoc
CONFLICT (modify/delete): updating/understanding-openshift-updates.adoc deleted in OSDOCS-2429: documenting how updates work and modified in HEAD. Version HEAD of updating/understanding-openshift-updates.adoc left in tree.
Auto-merging updating/index.adoc
Auto-merging security/container_security/security-hosts-vms.adoc
CONFLICT (content): Merge conflict in security/container_security/security-hosts-vms.adoc
Auto-merging modules/update-service-overview.adoc
Auto-merging _topic_maps/_topic_map.yml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OSDOCS-2429: documenting how updates work
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kcarmichael08 · 2023-06-23T17:21:47Z

/cherrypick enterprise-4.13

kcarmichael08 · 2023-06-23T17:21:57Z

/cherrypick enterprise-4.14

openshift-cherrypick-robot · 2023-06-23T17:22:01Z

@kcarmichael08: #60091 failed to apply on top of branch "enterprise-4.11":

Applying: OSDOCS-2429: documenting how updates work
Using index info to reconstruct a base tree...
M	_topic_maps/_topic_map.yml
M	modules/update-service-overview.adoc
M	security/container_security/security-hosts-vms.adoc
M	updating/index.adoc
M	updating/understanding-openshift-updates.adoc
M	updating/updating-restricted-network-cluster/restricted-network-update-osus.adoc
Falling back to patching base and 3-way merge...
Auto-merging updating/updating-restricted-network-cluster/restricted-network-update-osus.adoc
CONFLICT (modify/delete): updating/understanding-openshift-updates.adoc deleted in OSDOCS-2429: documenting how updates work and modified in HEAD. Version HEAD of updating/understanding-openshift-updates.adoc left in tree.
Auto-merging updating/index.adoc
Auto-merging security/container_security/security-hosts-vms.adoc
CONFLICT (content): Merge conflict in security/container_security/security-hosts-vms.adoc
Auto-merging modules/update-service-overview.adoc
Auto-merging _topic_maps/_topic_map.yml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OSDOCS-2429: documenting how updates work
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2023-06-23T17:22:09Z

@kcarmichael08: #60091 failed to apply on top of branch "enterprise-4.12":

Applying: OSDOCS-2429: documenting how updates work
Using index info to reconstruct a base tree...
M	_topic_maps/_topic_map.yml
M	modules/update-service-overview.adoc
M	security/container_security/security-hosts-vms.adoc
M	updating/index.adoc
M	updating/understanding-openshift-updates.adoc
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): updating/understanding-openshift-updates.adoc deleted in OSDOCS-2429: documenting how updates work and modified in HEAD. Version HEAD of updating/understanding-openshift-updates.adoc left in tree.
Auto-merging updating/index.adoc
Auto-merging security/container_security/security-hosts-vms.adoc
CONFLICT (content): Merge conflict in security/container_security/security-hosts-vms.adoc
Auto-merging modules/update-service-overview.adoc
Auto-merging _topic_maps/_topic_map.yml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OSDOCS-2429: documenting how updates work
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2023-06-23T17:22:12Z

@kcarmichael08: #60091 failed to apply on top of branch "enterprise-4.10":

Applying: OSDOCS-2429: documenting how updates work
Using index info to reconstruct a base tree...
M	_topic_maps/_topic_map.yml
M	modules/update-service-overview.adoc
M	security/container_security/security-hosts-vms.adoc
M	updating/index.adoc
M	updating/understanding-openshift-updates.adoc
M	updating/updating-restricted-network-cluster/restricted-network-update-osus.adoc
Falling back to patching base and 3-way merge...
Auto-merging updating/updating-restricted-network-cluster/restricted-network-update-osus.adoc
CONFLICT (modify/delete): updating/understanding-openshift-updates.adoc deleted in OSDOCS-2429: documenting how updates work and modified in HEAD. Version HEAD of updating/understanding-openshift-updates.adoc left in tree.
Auto-merging updating/index.adoc
Auto-merging security/container_security/security-hosts-vms.adoc
CONFLICT (content): Merge conflict in security/container_security/security-hosts-vms.adoc
Auto-merging modules/update-service-overview.adoc
Auto-merging _topic_maps/_topic_map.yml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OSDOCS-2429: documenting how updates work
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2023-06-23T17:22:29Z

@kcarmichael08: new pull request created: #61663

Details

In response to this:

/cherrypick enterprise-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2023-06-23T17:22:40Z

@kcarmichael08: new pull request created: #61664

Details

In response to this:

/cherrypick enterprise-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kcarmichael08 · 2023-06-23T17:33:57Z

@skopacz1 Sorry, we had another cherry pick failure for 4.10, 4.11, and 4.12. If you can create those manually, I can merge them like we did yesterday.

skopacz1 · 2023-06-23T17:41:15Z

@kcarmichael08 No worries, I'll make the manual cherry pick PRs now. Not my luckiest week!

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 16, 2023

openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 16, 2023

skopacz1 commented May 16, 2023

View reviewed changes

modules/update-mco-process.adoc Show resolved Hide resolved

skopacz1 commented May 16, 2023

View reviewed changes

images/update-runlevels.png Show resolved Hide resolved

skopacz1 force-pushed the OSDOCS-2429 branch from fd29ec7 to 3fa76d9 Compare May 16, 2023 15:40

openshift-ci bot requested a review from petr-muller May 18, 2023 11:29

openshift-ci bot requested a review from evakhoni May 22, 2023 14:51

evakhoni reviewed May 23, 2023

View reviewed changes

modules/update-evaluate-availability.adoc Show resolved Hide resolved

evakhoni reviewed May 23, 2023

View reviewed changes

modules/update-evaluate-availability.adoc Outdated Show resolved Hide resolved

evakhoni reviewed May 23, 2023

View reviewed changes

modules/update-manifest-application.adoc Outdated Show resolved Hide resolved

evakhoni reviewed May 23, 2023

View reviewed changes

modules/update-mco-process.adoc Outdated Show resolved Hide resolved

evakhoni reviewed May 23, 2023

View reviewed changes

modules/update-process-workflow.adoc Outdated Show resolved Hide resolved

evakhoni reviewed May 23, 2023

View reviewed changes

modules/update-process-workflow.adoc Outdated Show resolved Hide resolved

evakhoni reviewed May 23, 2023

View reviewed changes

modules/update-process-workflow.adoc Show resolved Hide resolved

evakhoni reviewed May 23, 2023

View reviewed changes

updating/understanding_updates/intro-to-updates.adoc Outdated Show resolved Hide resolved