doc/dev/upgrades: Add a blurb about restarting upgrades #388

cgwalters · 2020-06-17T17:06:35Z

I was going to make this a blog entry, but it got shot down
as sounding like a bug and we didn't want customers to worry.

Let's stash it here in upstream git where few customers
will find it, and the ones that do are hopefully more technically
inclined and will understand the logic.

(Originally stored at https://hackmd.io/Yph-TnRmR9ekzEJht0IzUA )

I was going to make this a blog entry, but it got shot down as sounding like a bug and we didn't want customers to worry. Let's stash it here in upstream git where few customers will find it, and the ones that do are hopefully more technically inclined and will understand the logic. (Originally stored at https://hackmd.io/Yph-TnRmR9ekzEJht0IzUA )

sdodson

This is the only reason it will restart? I thought we'd restart after we made no progress for a certain length of time and we start over hoping to make more progress?

sdodson · 2020-06-17T17:11:55Z

docs/dev/upgrades.md

+
+The MCO is just one of a number of "second level" operators that the CVO manages.  However, the relationship between the CVO and MCO is somewhat special because the MCO [updates the operating system itself](https://github.com/openshift/machine-config-operator/blob/master/docs/OSUpgrades.md) for the control plane.
+
+If the new release image has an updated operating system (`machine-os-content`), the CVO pulling down an update ends up causing it to (indirectly) restart itself.


Idle curiosity, what percentage of OCP releases bump machine-os-content?

sdodson · 2020-06-17T17:17:42Z

/approve

wking · 2020-06-18T02:01:38Z

I'd rather cache state in ClusterVersion so we can pick up where we left off (#264 was a step in that direction). I am agnostic about landing docs in the meantime.

sdodson · 2020-06-18T13:07:35Z

I'd rather cache state in ClusterVersion so we can pick up where we left off (#264 was a step in that direction). I am agnostic about landing docs in the meantime.

I'd go ahead and take this until such a time as we persist state somehow. Our track record for fixing non critical issues in CVO doesn't suggest we'll get to this anytime soon.

vrutkovs · 2020-07-07T07:59:50Z

docs/dev/upgrades.md

+
+Hence, the fact that the CVO is terminated and restarted is visible to components watching the `clusterversion` object as the status is recalculated.
+
+I could imagine at some point adding clarification for this; perhaps a basic boolean flag state in e.g. a `ConfigMap` or so that denoted that the pod was drained due to an upgrade, and the new CVO pod would "consume" that flag and include "Resuming upgrade..." text in its status. But I think that's probably all we should do.


@jottofar @wking correct me if I'm wrong, but a boolean flag for this is not necessary. New CVO can detect in-progress update when started (since current version doesn't yet matches desiredVersion).

Not sure if it can autodetect the percentage though

It knows it's in an update but I don't know how much state it saves, e.g. RetrievedUpdates, to know it has already loaded the update.

I am in favor of storing information about synced/failing manifests in ClusterVersion, which would allow us to pick back up where the previous CVO left off. But there's no such stored state today.

This discussion should be tracked in a Jira.

no need to track in Jira, this is already tracked in the bug that these docs link.

LalatenduMohanty

/lgtm
/hold in case others have any different opinion. If we do not receive any comment to change the PR after a day we can remove the hold.

openshift-ci-robot · 2020-09-08T14:12:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, LalatenduMohanty, sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [LalatenduMohanty,sdodson]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

LalatenduMohanty · 2020-09-09T18:08:00Z

/hold cancel

openshift-bot · 2020-09-09T18:58:55Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-09T19:24:55Z

/retest

Please review the full test history for this PR and help us cut down flakes.

sdodson · 2020-09-09T19:47:56Z

/override ci/prow/e2e-upgrade
docs only changes

openshift-ci-robot · 2020-09-09T19:47:59Z

@sdodson: Overrode contexts on behalf of sdodson: ci/prow/e2e-upgrade

Details

In response to this:

/override ci/prow/e2e-upgrade
docs only changes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2020-09-09T23:31:55Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T01:16:08Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T02:47:14Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T03:25:58Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T03:51:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T04:05:52Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T04:17:56Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T04:49:39Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T05:09:59Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T05:36:40Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T06:02:11Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T06:27:56Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T06:40:56Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T06:56:30Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T07:32:56Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-09-10T07:41:41Z

@cgwalters: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-gcp-upgrade	`3523eb5`	link	`/test e2e-gcp-upgrade`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2020-09-10T07:58:54Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot requested review from smarterclayton and vrutkovs June 17, 2020 17:06

sdodson reviewed Jun 17, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2020

vrutkovs reviewed Jul 7, 2020

View reviewed changes

LalatenduMohanty approved these changes Sep 8, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 8, 2020

openshift-ci-robot assigned LalatenduMohanty Sep 8, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 8, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 9, 2020

sdodson added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Sep 9, 2020

openshift-merge-robot merged commit f9f5591 into openshift:master Sep 10, 2020


		The MCO is just one of a number of "second level" operators that the CVO manages. However, the relationship between the CVO and MCO is somewhat special because the MCO [updates the operating system itself](https://github.com/openshift/machine-config-operator/blob/master/docs/OSUpgrades.md) for the control plane.

		If the new release image has an updated operating system (`machine-os-content`), the CVO pulling down an update ends up causing it to (indirectly) restart itself.


		Hence, the fact that the CVO is terminated and restarted is visible to components watching the `clusterversion` object as the status is recalculated.

		I could imagine at some point adding clarification for this; perhaps a basic boolean flag state in e.g. a `ConfigMap` or so that denoted that the pod was drained due to an upgrade, and the new CVO pod would "consume" that flag and include "Resuming upgrade..." text in its status. But I think that's probably all we should do.

doc/dev/upgrades: Add a blurb about restarting upgrades #388

doc/dev/upgrades: Add a blurb about restarting upgrades #388

Uh oh!

Conversation

cgwalters commented Jun 17, 2020

Uh oh!

sdodson left a comment

Choose a reason for hiding this comment

Uh oh!

sdodson Jun 17, 2020

Choose a reason for hiding this comment

Uh oh!

sdodson commented Jun 17, 2020

Uh oh!

wking commented Jun 18, 2020

Uh oh!

sdodson commented Jun 18, 2020

Uh oh!

vrutkovs Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

jottofar Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

wking Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

LalatenduMohanty Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking Sep 8, 2020

Choose a reason for hiding this comment

Uh oh!

LalatenduMohanty left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Sep 8, 2020

Uh oh!

LalatenduMohanty commented Sep 9, 2020

Uh oh!

openshift-bot commented Sep 9, 2020

Uh oh!

openshift-bot commented Sep 9, 2020

Uh oh!

sdodson commented Sep 9, 2020

Uh oh!

openshift-ci-robot commented Sep 9, 2020

Uh oh!

openshift-bot commented Sep 9, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-ci-robot commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

LalatenduMohanty Sep 8, 2020 •

edited

Loading

openshift-ci-robot commented Sep 10, 2020 •

edited

Loading