Skip to content

Conversation

@yuqi-zhang
Copy link
Contributor

An implementation of https://issues.redhat.com/browse/MCO-126. This should allow dual support for existing in-cluster OS image and new oci-format base os images.

To test, scale down CVO, then:

oc -n openshift-machine-config-operator edit configmap/machine-config-osimageurl

And switch to Colin's base image in openshift/os#657 (comment) :

osImageURL: registry.ci.openshift.org/coreos/walters-rhcos-ostreecontainer@sha256:3f57a0b046c023f837ae1c6d00f28e44a2a3c6201df556630698da29c942b2c8

And the MCPs should update to something like:

# rpm-ostree status
State: idle
Deployments:
* ostree-unverified-registry:registry.ci.openshift.org/coreos/walters-rhcos-ostreecontainer@sha256:3f57a0b046c023f837ae1c6d00f28e44a2a3c6201df556630698da29c942b2c8
                    Digest: sha256:3f57a0b046c023f837ae1c6d00f28e44a2a3c6201df556630698da29c942b2c8
                   Version: 410.84.202201172236-0 (2022-01-17T22:38:34Z)

  pivot://registry.build01.ci.openshift.org/ci-ln-4zjjy02/stable@sha256:cd6b8a84e7609d99fcdc4c3005c73a1a7434a2610dc262f63b4497782f774b42
              CustomOrigin: Managed by machine-config-operator
                   Version: 410.84.202201201741-0 (2022-01-20T17:43:40Z)

Somewhat WIP status due to some uncertainty, comments inline

openshift-merge-robot and others added 17 commits December 6, 2021 20:43
This is *mainly* to validate that
openshift/release#24225
worked.

But, this code may be useful as a sanity check going forward.

See also coreos/rpm-ostree#3251

(I also may try to expose e.g. `ex-container` as a feature flag
 that we can query instead of version-parsing)
…tree-version

[mcbs] Validate rpm-ostree version is new enough
IsCoreOSVariant and compareOSImageURL have already been called or
equivalent checks have been performed for all cases updateOS is called

Since updateOS no longer requires any members of Daemon, it can be made
a helper function instead of a method on Daemon
Call IsCoreOSVariant once in applyOSChanges instead of in every helper
function
The comment on the function says it's probably unnecessary, and it adds
unnecessary complexity to logic that must be maintained in two
separate OS update paths (one in update() and one in
checkStateOnFirstRun())
Certain helper methods should only be called on CoreOS, and it is more
reliable to type check this than rely on method preconditions
daemon.go/update.go: various cleanup surrounding OS updates
…se-1

[layering] update mcbs branch with master
Add e2e test that
1. creates image stream and pushes build to that image stream
2. uses that build with rpm-ostree rebase
3. successfully reboots into that image

Closes https://issues.redhat.com/browse/MCO-127
Otherwise e2e tests fail with "panic: test timed out after 1h30m0s"
Create PoC for booting from in-cluster built image
In preparation of CoreOS Layering, refactor our various locations
of OS updates into 1 function.

Left some investigations for later inline.
@openshift-ci openshift-ci bot requested review from cgwalters and jkyros February 2, 2022 03:08
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 2, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2022
@mkenigs
Copy link
Contributor

mkenigs commented Feb 2, 2022

Do you think we could figure out whether we can drop the bootstrapping and pivot cases now? I think that would allow clearer code in applyOSChanges. Right now it's unclear what logic is for layering and what's old logic. ExtractAndUpdateOS and applyExtensions behave differently based on isLayeredUpdate, but that's hidden within the function calls. I think that would be a bit cleaner if we could say

if isLayeredUpdate() {
  UpdateLayeredOS()
} else {
  ExtractAndUpdateOS()
  applyExtensions()
}

@mkenigs
Copy link
Contributor

mkenigs commented Feb 2, 2022

I think that would also make it easier to add some type safety for layered vs non-layered similar to 7c5a1e6

Comment on lines +1815 to +1817
if err = addExtensionsRepo(osImageContentDir); err != nil {
return
}
Copy link
Contributor

@mkenigs mkenigs Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you moved this here? It's unneeded by the pivot and bootstrap paths, correct? And do they clean it up? I only see the cleanup in applyOsChanges. +1 for removing pivot and bootstrap if possible 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just put it here as a placeholder since we no longer ship the extensions repo in the base OS. I can always move it into a separate function and add a check.

The other paths don't properly do a full OS update so I thought it wasn't really a big issue either way, since if it isn't used, it is just a no-op

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "bootstrap" path isn't really bootstrapping either, so I think that behaviour was technically wrong. Again, I think it isn't used, so I can try removing it

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks so much for starting this!

Thinking about this a bit more, perhaps instead of inspecting the container dynamically, it makes sense to use a new field in MachineConfig (osContainer instead of osImageURL?) and in the configmap.

Then we know from quite early on which path we'll take. We can make everything more "type safe" too...something like:

type OSUpdateSource struct {
  old *string
  new *docker.ImageReference
}

(Taking the opportunity to use a proper type for a container image reference instead of string)

In Rust of course this would be a nice enum

enum OSUpdateSource {
  Old(String)
  New(ImageReference)
}

which means the (IMO invalid) states of "no source" and "both sources" are omitted.

Though...I just wrote that but I am thinking actually in order to "ratchet" this into place, we probably actually need to have a bit of time (hopefully not long) where we ship both format containers. And so we may need to at least transiently represent and handle the "both" case.

OTOH, an advantage of dynamic inspection is we wouldn't need that ratchet, we could just do the swap, then delete the code in the MCO doing the inspection and handling the old format after.

So...dunno. I am good either way I guess.

@yuqi-zhang
Copy link
Contributor Author

Thinking about this a bit more, perhaps instead of inspecting the container dynamically, it makes sense to use a new field in MachineConfig (osContainer instead of osImageURL?) and in the configmap.

This could definitely work and potentially makes the transition smoother (or at least, gives us the flexibility to fall back). I just did it the current way in accordance with the card, and was mostly meant as a start for in-cluster MCD testing. By no means final.

Just thinking through what the other format would look like, the MC would have potentially 3 fields for now?

  1. osImageURL -> old, for backwards compatibility for now
  2. osContainer -> new format... base image? which then we expand into the full coreos-derive-built container later?
  3. Or do we do a 3rd field, osContainerLayered or something that contains the full layered update?

Wanted to move towards 3 anyways so I can draft up a separate PR (or on top of this one) to see what that might look like

@jkyros
Copy link
Member

jkyros commented Feb 3, 2022

Just to make sure I have this right:

  1. The presence of osContainer supercedes osImageURL
  2. The presence of osContainerLayered would supercede osImageURL, osContainer and the ignition from spec.config ?

@cgwalters
Copy link
Member

Or do we do a 3rd field, osContainerLayered or something that contains the full layered update?

Hmm, interesting. I think some of the current strawman designs call out for e.g. reworking the node controller and annotations to use the image as source of truth (e.g. instead of currentConfig we have currentImage which is just a container digest). And then the "rest of the MC" is in the image.

OTOH, maybe it actually is lower risk to basically just retain the node controller and pools etc. as is, and just add a new field. What confuses me a bit is it seems like we'd be making things a bit circular becuse the MC would be both an input to generating the image, and be changed when an image is output? Maybe we can address that by moving the finalImage/layeredImage field to a separate space outside of spec (e.g. an annotation?)

func ExtractAndUpdateOS(newOSImageURL string) (osImageContentDir string, err error) {
if isLayeredUpdate(newOSImageURL) {
// support new flow here
// Not sure if these commands are enough?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that should be it!

That said...there is a whole interesting discussion here because what we could actually do is rebase once to docker-registry.default.svc:5000/machine-config-operator/coreos-layered:latest and then just run rpm-ostree upgrade to pull the new image.

OTOH, most of everything inside the platform is all using explicit @sha256 digests for total predictability. So doing a rebase each time here is fine.

@cgwalters
Copy link
Member

Looks like TestBootInClusterImage failed, we definitely want that to work. Otherwise I think this LGTM.

@cgwalters
Copy link
Member

The presence of osContainer supercedes osImageURL

Right; this is just handling old format vs new format.

The presence of osContainerLayered would supercede osImageURL, osContainer and the ignition from spec.config ?

This though is a much more nuanced part and gets into exactly how we do the handoff between the rendered config, the image, and the MCD. The way I was thinking of this, if the MCD uses an image as source of truth, then the MCD stops reading MachineConfig from the API. Instead it just pulls out /etc/machine-config-operator/machine-config-encapsulated.json from the image (after pulling it down and staging the update, but before rebooting) and does "the rest of the stuff" (e.g. kernel arguments) at that time.

Another way to say this is it's having the MCD handle the stuff not handled by rpm-ostree rebase.

But...we could perhaps say let's try to get the MCD entirely out of that business, and we should have a way to ship kernel arguments in the container image and for the ostree stack to handle that. (It makes sense, needs design though)

@cgwalters
Copy link
Member

OK so the build didn't complete, but we don't know why. I can't find it though in the must-gather.

@yuqi-zhang
Copy link
Contributor Author

Funnily enough we had a quick chat today that went through basically the same thought process. Will be trying to draw some diagrams on what this flow could look like. I am definitely not against changing this to a higher level field. This just made it such that we don't yet alter the regular workflow.

Will be looking at tests after revisiting based on discussions.

@kikisdeliveryservice
Copy link
Contributor

just for safety adding a hold here :)

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2022
@cgwalters
Copy link
Member

OK sorry, I had lost track of this one. I think we can probably proceed with this?

@cgwalters cgwalters changed the base branch from mcbs to layering March 16, 2022 14:08
@mkenigs
Copy link
Contributor

mkenigs commented Mar 16, 2022

I think we might need to drop and/or combine some of this in favor of what's on @jkyros's branch. But the consolidate update paths is probably still helpful

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 16, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 16, 2022

@yuqi-zhang: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 16, 2022

@yuqi-zhang: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws 19916a0 link true /test e2e-aws
ci/prow/images 19916a0 link true /test images
ci/prow/e2e-aws-upgrade 19916a0 link false /test e2e-aws-upgrade
ci/prow/e2e-gcp-layering 19916a0 link true /test e2e-gcp-layering
ci/prow/unit 19916a0 link true /test unit
ci/prow/e2e-gcp-single-node 19916a0 link false /test e2e-gcp-single-node
ci/prow/verify 19916a0 link true /test verify
ci/prow/e2e-gcp-op 19916a0 link true /test e2e-gcp-op
ci/prow/e2e-gcp-op-single-node 19916a0 link false /test e2e-gcp-op-single-node

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@yuqi-zhang
Copy link
Contributor Author

Sorry I also lost track of this. I think this is probably outdated at this point. If there are any helpful parts of this we'd like to extract out specifically, I can look at doing so. Otherwise I think we should close this in favour of John's updated work.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 14, 2022
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 14, 2022
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Aug 14, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 14, 2022

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. layering lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants