[MCO-126] Have capability to consume new format base OS image #2939

yuqi-zhang · 2022-02-02T03:08:04Z

An implementation of https://issues.redhat.com/browse/MCO-126. This should allow dual support for existing in-cluster OS image and new oci-format base os images.

To test, scale down CVO, then:

oc -n openshift-machine-config-operator edit configmap/machine-config-osimageurl

And switch to Colin's base image in openshift/os#657 (comment) :

osImageURL: registry.ci.openshift.org/coreos/walters-rhcos-ostreecontainer@sha256:3f57a0b046c023f837ae1c6d00f28e44a2a3c6201df556630698da29c942b2c8

And the MCPs should update to something like:

# rpm-ostree status
State: idle
Deployments:
* ostree-unverified-registry:registry.ci.openshift.org/coreos/walters-rhcos-ostreecontainer@sha256:3f57a0b046c023f837ae1c6d00f28e44a2a3c6201df556630698da29c942b2c8
                    Digest: sha256:3f57a0b046c023f837ae1c6d00f28e44a2a3c6201df556630698da29c942b2c8
                   Version: 410.84.202201172236-0 (2022-01-17T22:38:34Z)

  pivot://registry.build01.ci.openshift.org/ci-ln-4zjjy02/stable@sha256:cd6b8a84e7609d99fcdc4c3005c73a1a7434a2610dc262f63b4497782f774b42
              CustomOrigin: Managed by machine-config-operator
                   Version: 410.84.202201201741-0 (2022-01-20T17:43:40Z)

Somewhat WIP status due to some uncertainty, comments inline

[mcbs] Fast forward to master

This is *mainly* to validate that openshift/release#24225 worked. But, this code may be useful as a sanity check going forward. See also coreos/rpm-ostree#3251 (I also may try to expose e.g. `ex-container` as a feature flag that we can query instead of version-parsing)

…tree-version [mcbs] Validate rpm-ostree version is new enough

IsCoreOSVariant and compareOSImageURL have already been called or equivalent checks have been performed for all cases updateOS is called Since updateOS no longer requires any members of Daemon, it can be made a helper function instead of a method on Daemon

Call IsCoreOSVariant once in applyOSChanges instead of in every helper function

The comment on the function says it's probably unnecessary, and it adds unnecessary complexity to logic that must be maintained in two separate OS update paths (one in update() and one in checkStateOnFirstRun())

Certain helper methods should only be called on CoreOS, and it is more reliable to type check this than rely on method preconditions

daemon.go/update.go: various cleanup surrounding OS updates

…se-1 [layering] update mcbs branch with master

Add e2e test that 1. creates image stream and pushes build to that image stream 2. uses that build with rpm-ostree rebase 3. successfully reboots into that image Closes https://issues.redhat.com/browse/MCO-127

Otherwise e2e tests fail with "panic: test timed out after 1h30m0s"

Create PoC for booting from in-cluster built image

In preparation of CoreOS Layering, refactor our various locations of OS updates into 1 function. Left some investigations for later inline.

openshift-ci · 2022-02-02T03:08:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mkenigs · 2022-02-02T16:16:46Z

Do you think we could figure out whether we can drop the bootstrapping and pivot cases now? I think that would allow clearer code in applyOSChanges. Right now it's unclear what logic is for layering and what's old logic. ExtractAndUpdateOS and applyExtensions behave differently based on isLayeredUpdate, but that's hidden within the function calls. I think that would be a bit cleaner if we could say

if isLayeredUpdate() {
  UpdateLayeredOS()
} else {
  ExtractAndUpdateOS()
  applyExtensions()
}

mkenigs · 2022-02-02T16:18:52Z

I think that would also make it easier to add some type safety for layered vs non-layered similar to 7c5a1e6

mkenigs · 2022-02-02T16:25:34Z

pkg/daemon/update.go

+		if err = addExtensionsRepo(osImageContentDir); err != nil {
+			return
+		}


Is there a reason you moved this here? It's unneeded by the pivot and bootstrap paths, correct? And do they clean it up? I only see the cleanup in applyOsChanges. +1 for removing pivot and bootstrap if possible 😉

I just put it here as a placeholder since we no longer ship the extensions repo in the base OS. I can always move it into a separate function and add a check.

The other paths don't properly do a full OS update so I thought it wasn't really a big issue either way, since if it isn't used, it is just a no-op

The "bootstrap" path isn't really bootstrapping either, so I think that behaviour was technically wrong. Again, I think it isn't used, so I can try removing it

cgwalters

Nice, thanks so much for starting this!

Thinking about this a bit more, perhaps instead of inspecting the container dynamically, it makes sense to use a new field in MachineConfig (osContainer instead of osImageURL?) and in the configmap.

Then we know from quite early on which path we'll take. We can make everything more "type safe" too...something like:

type OSUpdateSource struct {
  old *string
  new *docker.ImageReference
}

(Taking the opportunity to use a proper type for a container image reference instead of string)

In Rust of course this would be a nice enum

enum OSUpdateSource {
  Old(String)
  New(ImageReference)
}

which means the (IMO invalid) states of "no source" and "both sources" are omitted.

Though...I just wrote that but I am thinking actually in order to "ratchet" this into place, we probably actually need to have a bit of time (hopefully not long) where we ship both format containers. And so we may need to at least transiently represent and handle the "both" case.

OTOH, an advantage of dynamic inspection is we wouldn't need that ratchet, we could just do the swap, then delete the code in the MCO doing the inspection and handling the old format after.

So...dunno. I am good either way I guess.

yuqi-zhang · 2022-02-03T19:17:59Z

Thinking about this a bit more, perhaps instead of inspecting the container dynamically, it makes sense to use a new field in MachineConfig (osContainer instead of osImageURL?) and in the configmap.

This could definitely work and potentially makes the transition smoother (or at least, gives us the flexibility to fall back). I just did it the current way in accordance with the card, and was mostly meant as a start for in-cluster MCD testing. By no means final.

Just thinking through what the other format would look like, the MC would have potentially 3 fields for now?

osImageURL -> old, for backwards compatibility for now
osContainer -> new format... base image? which then we expand into the full coreos-derive-built container later?
Or do we do a 3rd field, osContainerLayered or something that contains the full layered update?

Wanted to move towards 3 anyways so I can draft up a separate PR (or on top of this one) to see what that might look like

jkyros · 2022-02-03T20:04:15Z

Just to make sure I have this right:

The presence of osContainer supercedes osImageURL
The presence of osContainerLayered would supercede osImageURL, osContainer and the ignition from spec.config ?

cgwalters · 2022-02-03T20:32:22Z

Or do we do a 3rd field, osContainerLayered or something that contains the full layered update?

Hmm, interesting. I think some of the current strawman designs call out for e.g. reworking the node controller and annotations to use the image as source of truth (e.g. instead of currentConfig we have currentImage which is just a container digest). And then the "rest of the MC" is in the image.

OTOH, maybe it actually is lower risk to basically just retain the node controller and pools etc. as is, and just add a new field. What confuses me a bit is it seems like we'd be making things a bit circular becuse the MC would be both an input to generating the image, and be changed when an image is output? Maybe we can address that by moving the finalImage/layeredImage field to a separate space outside of spec (e.g. an annotation?)

cgwalters · 2022-02-03T20:35:28Z

pkg/daemon/update.go

+func ExtractAndUpdateOS(newOSImageURL string) (osImageContentDir string, err error) {
+	if isLayeredUpdate(newOSImageURL) {
+		// support new flow here
+		// Not sure if these commands are enough?


Yep that should be it!

That said...there is a whole interesting discussion here because what we could actually do is rebase once to docker-registry.default.svc:5000/machine-config-operator/coreos-layered:latest and then just run rpm-ostree upgrade to pull the new image.

OTOH, most of everything inside the platform is all using explicit @sha256 digests for total predictability. So doing a rebase each time here is fine.

cgwalters · 2022-02-03T20:36:19Z

Looks like TestBootInClusterImage failed, we definitely want that to work. Otherwise I think this LGTM.

cgwalters · 2022-02-03T20:41:01Z

The presence of osContainer supercedes osImageURL

Right; this is just handling old format vs new format.

The presence of osContainerLayered would supercede osImageURL, osContainer and the ignition from spec.config ?

This though is a much more nuanced part and gets into exactly how we do the handoff between the rendered config, the image, and the MCD. The way I was thinking of this, if the MCD uses an image as source of truth, then the MCD stops reading MachineConfig from the API. Instead it just pulls out /etc/machine-config-operator/machine-config-encapsulated.json from the image (after pulling it down and staging the update, but before rebooting) and does "the rest of the stuff" (e.g. kernel arguments) at that time.

Another way to say this is it's having the MCD handle the stuff not handled by rpm-ostree rebase.

But...we could perhaps say let's try to get the MCD entirely out of that business, and we should have a way to ship kernel arguments in the container image and for the ostree stack to handle that. (It makes sense, needs design though)

cgwalters · 2022-02-03T22:39:21Z

OK so the build didn't complete, but we don't know why. I can't find it though in the must-gather.

yuqi-zhang · 2022-02-04T00:09:45Z

Funnily enough we had a quick chat today that went through basically the same thought process. Will be trying to draw some diagrams on what this flow could look like. I am definitely not against changing this to a higher level field. This just made it such that we don't yet alter the regular workflow.

Will be looking at tests after revisiting based on discussions.

kikisdeliveryservice · 2022-02-08T21:41:07Z

just for safety adding a hold here :)

/hold

cgwalters · 2022-03-16T14:08:03Z

OK sorry, I had lost track of this one. I think we can probably proceed with this?

mkenigs · 2022-03-16T14:13:30Z

I think we might need to drop and/or combine some of this in favor of what's on @jkyros's branch. But the consolidate update paths is probably still helpful

openshift-ci · 2022-03-16T14:50:30Z

@yuqi-zhang: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-03-16T15:00:58Z

@yuqi-zhang: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws	`19916a0`	link	true	`/test e2e-aws`
ci/prow/images	`19916a0`	link	true	`/test images`
ci/prow/e2e-aws-upgrade	`19916a0`	link	false	`/test e2e-aws-upgrade`
ci/prow/e2e-gcp-layering	`19916a0`	link	true	`/test e2e-gcp-layering`
ci/prow/unit	`19916a0`	link	true	`/test unit`
ci/prow/e2e-gcp-single-node	`19916a0`	link	false	`/test e2e-gcp-single-node`
ci/prow/verify	`19916a0`	link	true	`/test verify`
ci/prow/e2e-gcp-op	`19916a0`	link	true	`/test e2e-gcp-op`
ci/prow/e2e-gcp-op-single-node	`19916a0`	link	false	`/test e2e-gcp-op-single-node`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

yuqi-zhang · 2022-03-16T17:06:22Z

Sorry I also lost track of this. I think this is probably outdated at this point. If there are any helpful parts of this we'd like to extract out specifically, I can look at doing so. Otherwise I think we should close this in favour of John's updated work.

openshift-bot · 2022-06-14T18:43:49Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2022-07-14T19:09:41Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2022-08-14T00:00:35Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2022-08-14T00:02:25Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-merge-robot and others added 17 commits December 6, 2021 20:43

Merge pull request openshift#2860 from openshift/master

279b55a

[mcbs] Fast forward to master

Merge pull request openshift#2859 from cgwalters/mcbs-validate-rpm-os…

c9de8ff

…tree-version [mcbs] Validate rpm-ostree version is new enough

Consolidate IsCoreOSVariant calls

1d8dd05

Call IsCoreOSVariant once in applyOSChanges instead of in every helper function

Remove compareOSImageURL

7226ea5

The comment on the function says it's probably unnecessary, and it adds unnecessary complexity to logic that must be maintained in two separate OS update paths (one in update() and one in checkStateOnFirstRun())

Add CoreOSDaemon struct

7c5a1e6

Certain helper methods should only be called on CoreOS, and it is more reliable to type check this than rely on method preconditions

Merge pull request openshift#2896 from mkenigs/os-update-cleanup

da8793c

daemon.go/update.go: various cleanup surrounding OS updates

Merge branch 'master' into mcbs-rebase-1

f26cacb

Merge pull request openshift#2912 from kikisdeliveryservice/mcbs-reba…

4345d4a

…se-1 [layering] update mcbs branch with master

Create PoC for booting from in-cluster built image

d0b90c3

Add e2e test that 1. creates image stream and pushes build to that image stream 2. uses that build with rpm-ostree rebase 3. successfully reboots into that image Closes https://issues.redhat.com/browse/MCO-127

make go-deps && make update

994bb01

Makefile: bump test e2e timeout to 100m

7eac5c1

Otherwise e2e tests fail with "panic: test timed out after 1h30m0s"

Merge pull request openshift#2886 from mkenigs/127-poc

ff2a39a

Create PoC for booting from in-cluster built image

Consolidate OS update paths

dec8dee

In preparation of CoreOS Layering, refactor our various locations of OS updates into 1 function. Left some investigations for later inline.

Add new image type detection and update method

b4dc409

Update BootedOSImage detection for validation

19916a0

openshift-ci bot requested review from cgwalters and jkyros February 2, 2022 03:08

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2022

mkenigs reviewed Feb 2, 2022

View reviewed changes

cgwalters reviewed Feb 3, 2022

View reviewed changes

cgwalters added the layering label Feb 3, 2022

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2022

cgwalters changed the base branch from mcbs to layering March 16, 2022 14:08

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 16, 2022

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 14, 2022

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 14, 2022

jkyros mentioned this pull request Jul 28, 2022

Allow overriding OSImageURL with a layered image #3272

Merged

openshift-ci bot closed this Aug 14, 2022

[MCO-126] Have capability to consume new format base OS image #2939

[MCO-126] Have capability to consume new format base OS image #2939

Uh oh!

Conversation

yuqi-zhang commented Feb 2, 2022

Uh oh!

openshift-ci bot commented Feb 2, 2022

Uh oh!

mkenigs commented Feb 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkenigs commented Feb 2, 2022

Uh oh!

mkenigs Feb 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang Feb 2, 2022

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang Feb 2, 2022

Choose a reason for hiding this comment

Uh oh!

cgwalters left a comment

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang commented Feb 3, 2022

Uh oh!

jkyros commented Feb 3, 2022

Uh oh!

cgwalters commented Feb 3, 2022

Uh oh!

cgwalters Feb 3, 2022

Choose a reason for hiding this comment

Uh oh!

cgwalters commented Feb 3, 2022

Uh oh!

cgwalters commented Feb 3, 2022

Uh oh!

cgwalters commented Feb 3, 2022

Uh oh!

yuqi-zhang commented Feb 4, 2022

Uh oh!

kikisdeliveryservice commented Feb 8, 2022

Uh oh!

cgwalters commented Mar 16, 2022

Uh oh!

mkenigs commented Mar 16, 2022

Uh oh!

openshift-ci bot commented Mar 16, 2022

Uh oh!

openshift-ci bot commented Mar 16, 2022

Uh oh!

yuqi-zhang commented Mar 16, 2022

Uh oh!

openshift-bot commented Jun 14, 2022

Uh oh!

openshift-bot commented Jul 14, 2022

Uh oh!

openshift-bot commented Aug 14, 2022

Uh oh!

openshift-ci bot commented Aug 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mkenigs commented Feb 2, 2022 •

edited

Loading

mkenigs Feb 2, 2022 •

edited

Loading