-
Notifications
You must be signed in to change notification settings - Fork 1.9k
OSDOCS#2429: documenting how updates work #60091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@skopacz1: This pull request references OSDOCS-2429 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@skopacz1: This pull request references OSDOCS-2429 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cc |
|
@shellyyang1989 PTAL, thank you! |
|
/cc |
modules/update-mco-process.adoc
Outdated
| = Machine Config Operator node updates | ||
| The Machine Config Operator (MCO) applies a new machine configuration to each control plane and compute node. During this process, the MCO performs the following sequential actions in the cluster: | ||
|
|
||
| . Cordon and drain all of the nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we mention one by one?
also, a paused machine pool in EUS-to-EUS for example, is an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify the exact order of operations here? Is it "Cordon and drain each node, one at a time, and then update the OS of each node one at a time, and then reboot one a time, etc."? Or is a single node cordoned, drained, updated, rebooted, uncordoned, and scheduled before the next node initiates the same process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not one at a time, it's configurable for each pool (.spec.maxUnavailable). One is the default behavior.
Or is a single node cordoned, drained, updated, rebooted, uncordoned, and scheduled before the next node initiates the same process?
^^^ this is the process when maxUnavailable is one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a card for MCO team to provide detailed documentation on their part of the upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not one at a time, it's configurable for each pool
u're right! in any case, never All of them at once afaik.
I think there's a card for MCO team
hmm.. if they end up having a detailed explanation somewhere in another page, perhaps we eventually just have a link to that page in this article?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...one pool of cordoned nodes at a time...
MachineConfigPools have a maxUnavailable knob, defaulting to 1. When the machine-config controller notices that a MachineConfigPool spec calls for an update, and fewer than maxUnavailable Nodes in that pool are unavailable, it will cordon that Node, take some actions which may, but do not necessarily, include draining and/or rebooting, and uncordon the Node. For OCP updates, there are almost always RHCOS updates getting rolled out, and those need the full drain and reboot. But every once and a while we'll ship the same RHCOS and relevant MachineConfigs in two separate 4.y.z, and then you can have an OCP update that requires no drains or reboots. Unclear to me how much of this context is worth including. Personally, I'm fine with anything including:
- Don't even mention the machine-config operator.
- Mention that the machine-config operator exists, that it updates (unpaused) MachineConfigPools, and that that can take some time.
- Explaining that the time MachineConfigPools can take depends (almost always) on the time it takes to drain them, and also on the number of nodes you've ok'ed draining in parallel via
maxUnavailable. - Explaining how to figure out which MachineConfigPool updates will trigger drains, and which will trigger reboots, and which will trigger neither.
But it's "up to maxUnavailable nodes from a MachineConfigPool" being cordoned at any given time, not the whole pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, reading your comment here I now realize that your "one pool of cordoned nodes" means "up to maxUnavailable nodes from a MachineConfigPool". I'd rather avoid the word "pool" for that, to keep folks from falling into the same misconception I did and assume that when you said "pool of ... nodes" you mean the whole MachineConfigPool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with Trevor, the wording "pool" may mislead in this context. back to line 10 "Cordon and drain all of the nodes" we may rather remove the word "all"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there aren't any particular rewording suggestions, I can use "group" in place of "pool". I can also remove the word "all" to further avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I will save further refinement along the lines of Trevor's first comment for a second iteration of this doc.
modules/update-mco-process.adoc
Outdated
| When a node is cordoned, workloads cannot be scheduled to it. | ||
| ==== | ||
|
|
||
| The time to complete this process depends on several factors including the node and infrastructure configuration. This process might take 5 or more minutes to complete per node. No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder what's the maximal time acceptable for the task.. are we omitting intentionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is deeper context to this from when this content was first made for the "Understanding update duration" page.
@LalatenduMohanty, @petr-muller, or @wking might know, but my memory/understanding is that it's intentionally omitted to communicate that: "This process may take a long time but there are too many variations and factors for us to guarantee a specific timeframe this would be completed by". I could be wrong though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very dependent on circumstances - user workloads specifically. Your understanding is correct, it's hard to guarantee. It is also possible to configure the workloads in a way that completely stucks the upgrade, unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.. then it make sense to omit since we cannot guarantee... (unless there's some high mark that we can actually consider as: ok it took way too much time at this point, its probably stuck and not just taking longer than ususal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The machine-config operator will fire a warning MCDDrainError after an hour. But 🤷, many workloads like the Prow CI jobs we run as CVO presubmits take multiple hours in unevictable, PodDisruptionBudget-protected pods, and that's normal too. I would rather not get opinionated about how long "surprisingly long" is, and leave that up to the cluster administrator and workload administrators to negotiate between themselves.
modules/update-mco-process.adoc
Outdated
|
|
||
| :_content-type: CONCEPT | ||
| [id="mco-update-process_{context}"] | ||
| = Machine Config Operator node updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rioliu-rh could you please take a look at the mco update section? Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Machine Config Operator node updates/Understanding how the Machine Config Operator updates nodes
Currently, it sounds like we are updating the Machine Config Operator node.
|
/label merge-review-needed |
|
|
|
/cherrypick enterprise-4.10 |
|
/cherrypick enterprise-4.11 |
|
/cherrypick enterprise-4.12 |
|
@kcarmichael08: #60091 failed to apply on top of branch "enterprise-4.10": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherrypick enterprise-4.13 |
|
/cherrypick enterprise-4.14 |
|
@kcarmichael08: #60091 failed to apply on top of branch "enterprise-4.11": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@kcarmichael08: #60091 failed to apply on top of branch "enterprise-4.12": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@kcarmichael08: #60091 failed to apply on top of branch "enterprise-4.10": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@kcarmichael08: new pull request created: #61663 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@kcarmichael08: new pull request created: #61664 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@skopacz1 Sorry, we had another cherry pick failure for 4.10, 4.11, and 4.12. If you can create those manually, I can merge them like we did yesterday. |
|
@kcarmichael08 No worries, I'll make the manual cherry pick PRs now. Not my luckiest week! |
OSDOCS-2429
Version(s):
4.10+
This PR introduces detailed content about how OpenShift Updates work in the background. This PR has made the following changes to the Updating clusters documentation:
Link to docs preview:
QE review: