-
Notifications
You must be signed in to change notification settings - Fork 213
doc/dev/upgrades: Add a blurb about restarting upgrades #388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc/dev/upgrades: Add a blurb about restarting upgrades #388
Conversation
I was going to make this a blog entry, but it got shot down as sounding like a bug and we didn't want customers to worry. Let's stash it here in upstream git where few customers will find it, and the ones that do are hopefully more technically inclined and will understand the logic. (Originally stored at https://hackmd.io/Yph-TnRmR9ekzEJht0IzUA )
sdodson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only reason it will restart? I thought we'd restart after we made no progress for a certain length of time and we start over hoping to make more progress?
|
|
||
| The MCO is just one of a number of "second level" operators that the CVO manages. However, the relationship between the CVO and MCO is somewhat special because the MCO [updates the operating system itself](https://github.com/openshift/machine-config-operator/blob/master/docs/OSUpgrades.md) for the control plane. | ||
|
|
||
| If the new release image has an updated operating system (`machine-os-content`), the CVO pulling down an update ends up causing it to (indirectly) restart itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idle curiosity, what percentage of OCP releases bump machine-os-content?
|
/approve |
|
I'd rather cache state in ClusterVersion so we can pick up where we left off (#264 was a step in that direction). I am agnostic about landing docs in the meantime. |
I'd go ahead and take this until such a time as we persist state somehow. Our track record for fixing non critical issues in CVO doesn't suggest we'll get to this anytime soon. |
|
|
||
| Hence, the fact that the CVO is terminated and restarted is visible to components watching the `clusterversion` object as the status is recalculated. | ||
|
|
||
| I could imagine at some point adding clarification for this; perhaps a basic boolean flag state in e.g. a `ConfigMap` or so that denoted that the pod was drained due to an upgrade, and the new CVO pod would "consume" that flag and include "Resuming upgrade..." text in its status. But I think that's probably all we should do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It knows it's in an update but I don't know how much state it saves, e.g. RetrievedUpdates, to know it has already loaded the update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favor of storing information about synced/failing manifests in ClusterVersion, which would allow us to pick back up where the previous CVO left off. But there's no such stored state today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This discussion should be tracked in a Jira.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to track in Jira, this is already tracked in the bug that these docs link.
LalatenduMohanty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold in case others have any different opinion. If we do not receive any comment to change the PR after a day we can remove the hold.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, LalatenduMohanty, sdodson The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/override ci/prow/e2e-upgrade |
|
@sdodson: Overrode contexts on behalf of sdodson: ci/prow/e2e-upgrade DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
8 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
5 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@cgwalters: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
I was going to make this a blog entry, but it got shot down
as sounding like a bug and we didn't want customers to worry.
Let's stash it here in upstream git where few customers
will find it, and the ones that do are hopefully more technically
inclined and will understand the logic.
(Originally stored at https://hackmd.io/Yph-TnRmR9ekzEJht0IzUA )