Skip to content

Conversation

@skopacz1
Copy link
Contributor

@skopacz1 skopacz1 commented May 16, 2023

OSDOCS-2429

Version(s):
4.10+

This PR introduces detailed content about how OpenShift Updates work in the background. This PR has made the following changes to the Updating clusters documentation:

  • The previous Understanding OpenShift updates assembly has been renamed to "Introduction to OpenShift updates" and has been nested within a new subsection titled "Understanding OpenShift Updates".
  • A summarized description of how updates work has been added to the top of the "intro to updates" assembly
  • A new assembly titled "How cluster updates work" has been created, containing more detailed information about each major aspect of the OCP update process.

Link to docs preview:

QE review:

  • QE has approved this change.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 16, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented May 16, 2023

@skopacz1: This pull request references OSDOCS-2429 which is a valid jira issue.

Details

In response to this:

OSDOCS-2429

Version(s):
4.10+

This PR introduces detailed content about how OpenShift Updates work in the background. This PR has made the following changes to the Updating clusters documentation:

  • The previous Understanding OpenShift updates assembly has been renamed to "Introduction to OpenShift updates" and has been nested within a new subsection titled "Understanding OpenShift Updates".
  • A summarized description of how updates work has been added to the top of the "intro to updates" assembly
  • A new assembly titled "How cluster updates work" has been created, containing more detailed information about each major aspect of the OCP update process.

Link to docs preview:

QE review:

  • QE has approved this change.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 16, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented May 16, 2023

@skopacz1: This pull request references OSDOCS-2429 which is a valid jira issue.

Details

In response to this:

OSDOCS-2429

Version(s):
4.10+

This PR introduces detailed content about how OpenShift Updates work in the background. This PR has made the following changes to the Updating clusters documentation:

  • The previous Understanding OpenShift updates assembly has been renamed to "Introduction to OpenShift updates" and has been nested within a new subsection titled "Understanding OpenShift Updates".
  • A summarized description of how updates work has been added to the top of the "intro to updates" assembly
  • A new assembly titled "How cluster updates work" has been created, containing more detailed information about each major aspect of the OCP update process.

Link to docs preview:

QE review:

  • QE has approved this change.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member

/cc

@openshift-ci openshift-ci bot requested a review from petr-muller May 18, 2023 11:29
@skopacz1
Copy link
Contributor Author

@shellyyang1989 PTAL, thank you!

@evakhoni
Copy link

/cc

@openshift-ci openshift-ci bot requested a review from evakhoni May 22, 2023 14:51
= Machine Config Operator node updates
The Machine Config Operator (MCO) applies a new machine configuration to each control plane and compute node. During this process, the MCO performs the following sequential actions in the cluster:

. Cordon and drain all of the nodes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we mention one by one?
also, a paused machine pool in EUS-to-EUS for example, is an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify the exact order of operations here? Is it "Cordon and drain each node, one at a time, and then update the OS of each node one at a time, and then reboot one a time, etc."? Or is a single node cordoned, drained, updated, rebooted, uncordoned, and scheduled before the next node initiates the same process?

Copy link
Member

@petr-muller petr-muller May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not one at a time, it's configurable for each pool (.spec.maxUnavailable). One is the default behavior.

Or is a single node cordoned, drained, updated, rebooted, uncordoned, and scheduled before the next node initiates the same process?

^^^ this is the process when maxUnavailable is one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a card for MCO team to provide detailed documentation on their part of the upgrade

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not one at a time, it's configurable for each pool

u're right! in any case, never All of them at once afaik.

I think there's a card for MCO team

hmm.. if they end up having a detailed explanation somewhere in another page, perhaps we eventually just have a link to that page in this article?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...one pool of cordoned nodes at a time...

MachineConfigPools have a maxUnavailable knob, defaulting to 1. When the machine-config controller notices that a MachineConfigPool spec calls for an update, and fewer than maxUnavailable Nodes in that pool are unavailable, it will cordon that Node, take some actions which may, but do not necessarily, include draining and/or rebooting, and uncordon the Node. For OCP updates, there are almost always RHCOS updates getting rolled out, and those need the full drain and reboot. But every once and a while we'll ship the same RHCOS and relevant MachineConfigs in two separate 4.y.z, and then you can have an OCP update that requires no drains or reboots. Unclear to me how much of this context is worth including. Personally, I'm fine with anything including:

But it's "up to maxUnavailable nodes from a MachineConfigPool" being cordoned at any given time, not the whole pool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, reading your comment here I now realize that your "one pool of cordoned nodes" means "up to maxUnavailable nodes from a MachineConfigPool". I'd rather avoid the word "pool" for that, to keep folks from falling into the same misconception I did and assume that when you said "pool of ... nodes" you mean the whole MachineConfigPool.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with Trevor, the wording "pool" may mislead in this context. back to line 10 "Cordon and drain all of the nodes" we may rather remove the word "all"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there aren't any particular rewording suggestions, I can use "group" in place of "pool". I can also remove the word "all" to further avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I will save further refinement along the lines of Trevor's first comment for a second iteration of this doc.

When a node is cordoned, workloads cannot be scheduled to it.
====

The time to complete this process depends on several factors including the node and infrastructure configuration. This process might take 5 or more minutes to complete per node. No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder what's the maximal time acceptable for the task.. are we omitting intentionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is deeper context to this from when this content was first made for the "Understanding update duration" page.

@LalatenduMohanty, @petr-muller, or @wking might know, but my memory/understanding is that it's intentionally omitted to communicate that: "This process may take a long time but there are too many variations and factors for us to guarantee a specific timeframe this would be completed by". I could be wrong though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very dependent on circumstances - user workloads specifically. Your understanding is correct, it's hard to guarantee. It is also possible to configure the workloads in a way that completely stucks the upgrade, unfortunately.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. then it make sense to omit since we cannot guarantee... (unless there's some high mark that we can actually consider as: ok it took way too much time at this point, its probably stuck and not just taking longer than ususal)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The machine-config operator will fire a warning MCDDrainError after an hour. But 🤷, many workloads like the Prow CI jobs we run as CVO presubmits take multiple hours in unevictable, PodDisruptionBudget-protected pods, and that's normal too. I would rather not get opinionated about how long "surprisingly long" is, and leave that up to the cluster administrator and workload administrators to negotiate between themselves.


:_content-type: CONCEPT
[id="mco-update-process_{context}"]
= Machine Config Operator node updates

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rioliu-rh could you please take a look at the mco update section? Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Machine Config Operator node updates/Understanding how the Machine Config Operator updates nodes
Currently, it sounds like we are updating the Machine Config Operator node.

@skopacz1
Copy link
Contributor Author

/label merge-review-needed

@openshift-ci openshift-ci bot added the merge-review-needed Signifies that the merge review team needs to review this PR label Jun 23, 2023
@petr-muller
Copy link
Member

:shipit: :shipit: :shipit:

@kcarmichael08 kcarmichael08 added the merge-review-in-progress Signifies that the merge review team is reviewing this PR label Jun 23, 2023
@kcarmichael08 kcarmichael08 merged commit 3a53aff into openshift:main Jun 23, 2023
@kcarmichael08
Copy link
Contributor

/cherrypick enterprise-4.10

@kcarmichael08
Copy link
Contributor

/cherrypick enterprise-4.11

@kcarmichael08
Copy link
Contributor

/cherrypick enterprise-4.12

@openshift-cherrypick-robot

@kcarmichael08: #60091 failed to apply on top of branch "enterprise-4.10":

Applying: OSDOCS-2429: documenting how updates work
Using index info to reconstruct a base tree...
M	_topic_maps/_topic_map.yml
M	modules/update-service-overview.adoc
M	security/container_security/security-hosts-vms.adoc
M	updating/index.adoc
M	updating/understanding-openshift-updates.adoc
M	updating/updating-restricted-network-cluster/restricted-network-update-osus.adoc
Falling back to patching base and 3-way merge...
Auto-merging updating/updating-restricted-network-cluster/restricted-network-update-osus.adoc
CONFLICT (modify/delete): updating/understanding-openshift-updates.adoc deleted in OSDOCS-2429: documenting how updates work and modified in HEAD. Version HEAD of updating/understanding-openshift-updates.adoc left in tree.
Auto-merging updating/index.adoc
Auto-merging security/container_security/security-hosts-vms.adoc
CONFLICT (content): Merge conflict in security/container_security/security-hosts-vms.adoc
Auto-merging modules/update-service-overview.adoc
Auto-merging _topic_maps/_topic_map.yml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OSDOCS-2429: documenting how updates work
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kcarmichael08
Copy link
Contributor

/cherrypick enterprise-4.13

@kcarmichael08
Copy link
Contributor

/cherrypick enterprise-4.14

@openshift-cherrypick-robot

@kcarmichael08: #60091 failed to apply on top of branch "enterprise-4.11":

Applying: OSDOCS-2429: documenting how updates work
Using index info to reconstruct a base tree...
M	_topic_maps/_topic_map.yml
M	modules/update-service-overview.adoc
M	security/container_security/security-hosts-vms.adoc
M	updating/index.adoc
M	updating/understanding-openshift-updates.adoc
M	updating/updating-restricted-network-cluster/restricted-network-update-osus.adoc
Falling back to patching base and 3-way merge...
Auto-merging updating/updating-restricted-network-cluster/restricted-network-update-osus.adoc
CONFLICT (modify/delete): updating/understanding-openshift-updates.adoc deleted in OSDOCS-2429: documenting how updates work and modified in HEAD. Version HEAD of updating/understanding-openshift-updates.adoc left in tree.
Auto-merging updating/index.adoc
Auto-merging security/container_security/security-hosts-vms.adoc
CONFLICT (content): Merge conflict in security/container_security/security-hosts-vms.adoc
Auto-merging modules/update-service-overview.adoc
Auto-merging _topic_maps/_topic_map.yml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OSDOCS-2429: documenting how updates work
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@kcarmichael08: #60091 failed to apply on top of branch "enterprise-4.12":

Applying: OSDOCS-2429: documenting how updates work
Using index info to reconstruct a base tree...
M	_topic_maps/_topic_map.yml
M	modules/update-service-overview.adoc
M	security/container_security/security-hosts-vms.adoc
M	updating/index.adoc
M	updating/understanding-openshift-updates.adoc
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): updating/understanding-openshift-updates.adoc deleted in OSDOCS-2429: documenting how updates work and modified in HEAD. Version HEAD of updating/understanding-openshift-updates.adoc left in tree.
Auto-merging updating/index.adoc
Auto-merging security/container_security/security-hosts-vms.adoc
CONFLICT (content): Merge conflict in security/container_security/security-hosts-vms.adoc
Auto-merging modules/update-service-overview.adoc
Auto-merging _topic_maps/_topic_map.yml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OSDOCS-2429: documenting how updates work
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@kcarmichael08: #60091 failed to apply on top of branch "enterprise-4.10":

Applying: OSDOCS-2429: documenting how updates work
Using index info to reconstruct a base tree...
M	_topic_maps/_topic_map.yml
M	modules/update-service-overview.adoc
M	security/container_security/security-hosts-vms.adoc
M	updating/index.adoc
M	updating/understanding-openshift-updates.adoc
M	updating/updating-restricted-network-cluster/restricted-network-update-osus.adoc
Falling back to patching base and 3-way merge...
Auto-merging updating/updating-restricted-network-cluster/restricted-network-update-osus.adoc
CONFLICT (modify/delete): updating/understanding-openshift-updates.adoc deleted in OSDOCS-2429: documenting how updates work and modified in HEAD. Version HEAD of updating/understanding-openshift-updates.adoc left in tree.
Auto-merging updating/index.adoc
Auto-merging security/container_security/security-hosts-vms.adoc
CONFLICT (content): Merge conflict in security/container_security/security-hosts-vms.adoc
Auto-merging modules/update-service-overview.adoc
Auto-merging _topic_maps/_topic_map.yml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OSDOCS-2429: documenting how updates work
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@kcarmichael08: new pull request created: #61663

Details

In response to this:

/cherrypick enterprise-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@kcarmichael08: new pull request created: #61664

Details

In response to this:

/cherrypick enterprise-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kcarmichael08
Copy link
Contributor

@skopacz1 Sorry, we had another cherry pick failure for 4.10, 4.11, and 4.12. If you can create those manually, I can merge them like we did yesterday.

@skopacz1
Copy link
Contributor Author

@kcarmichael08 No worries, I'll make the manual cherry pick PRs now. Not my luckiest week!

@kcarmichael08 kcarmichael08 removed merge-review-in-progress Signifies that the merge review team is reviewing this PR merge-review-needed Signifies that the merge review team needs to review this PR labels Jun 23, 2023
@skopacz1 skopacz1 deleted the OSDOCS-2429 branch October 31, 2023 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

branch/enterprise-4.10 branch/enterprise-4.11 branch/enterprise-4.12 branch/enterprise-4.13 branch/enterprise-4.14 peer-review-done Signifies that the peer review team has reviewed this PR size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.