Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea: "upper layer" (erofs) inside of initramfs #332

Closed
allisonkarlitskaya opened this issue Sep 5, 2024 · 33 comments
Closed

idea: "upper layer" (erofs) inside of initramfs #332

allisonkarlitskaya opened this issue Sep 5, 2024 · 33 comments
Labels
area/booting Issues related to booting with composefs enhancement New feature or request

Comments

@allisonkarlitskaya
Copy link
Collaborator

This is a really vague idea that I discussed with @cgwalters and @travier today. They both said it belongs here as an issue. At this point this is little more than a raw braindump. There's a lot to think through and discuss.

The erofs produced by mkcomposefs on a reasonably complete /usr is on the order of double digits MB. I've seen ~50MB generally, and it compresses well (down to more like 10MB). The initramfs+kernel on my Silverblue system is low triple digits (~150MB, most of which is the initramfs).

It wouldn't be completely unreasonable, then, to have a complete static copy of the composefs "upper layer" erofs image inside of the UKI. This would completely side-step quite a lot of thorny issues around binding the UKI to the correct deployment: all you'd need is the kernel image and the digest store.

How we get a UKI with this erofs inside of it could go two ways:

  • generate this on the end-user system by (deterministic magic) which lets us get a UKI which is bit-for-bit the same as the one we were expecting it to be. We'd have some out-of-band signature somewhere (in some metadata that doesn't become part of the image) that we could then use for signing this.

  • push everything to the container image creation: the kernel image would be created as the last step of the image creation process. This would involve running mkcomposefs inside of the container, on the contents of the container itself, and embedding the resulting blob into the UKI, which we'd then write to the container image at a well-known path. Any signing that we might want to do as part of creating the image could happen at this point, inside of the image (or in another build stage and copied back into the final image).

The second approach has an extremely simple deployment strategy: just extract the container .tar directly into a composefs digest store (without creating the erofs). The backing store should now contain all of the files that the erofs referred to. Install the kernel image into the EFI ESP and you're done.

The second way seems wonderfully simple until you realize that there are some very serious drawbacks there:

  • we're essentially creating a new container format: the metadata about which files are part of the image is stored in the .tar of the image, but now also in the erofs that we put inside of the UKI.
  • which means, of course, that it's no longer possible to make casual modifications to the container to add a file or install an extra package or so: you need to regenerate the kernel image. Maybe that's not so bad?

I think the second approach could be extremely nice for specific deployment scenarios, but it's a very different flavour than what has been promised for the "FROM fedora / ADD / RUN / ..." approach to OS customization.

So that takes us back to a reality where we probably want to support the first scenario of building the composefs and assembling the UKI on the end system. That needs a lot of thinking...

This also intersects with the question about what a signature from an OS vendor on a particular kernel image means. Today it's possible to have a signed kernel boot an unsigned root filesystem. Tomorrow we seem to want to go into a direction where there's additional assurances about the root filesystem contents as well, but if it remains possible to continue booting arbitrary root filesystems with a different version of the same kernel, then this promise is a whole lot less meaningful. In fact, the entire "look how easy it is to customize your system!" bootc ideal sort of depends on being able to modify the root filesystem without needing to resign the kernel... @travier mentioned that we can support both scenarios with kernel variations which produce unique PCR measurements, allowing the data partition to be encrypted by a key that will only be available if we boot a "trusted" rootfs. There are some very deep product-level decisions here...

@allisonkarlitskaya
Copy link
Collaborator Author

One note about performance/memory trade-offs: having the erofs as part of the UKI (and then permanently stored in RAM) would mean that the entire metadata of the system partition is in RAM. ls -lR /usr would always happen without touching the disk. It's more data to load when booting the kernel image, but having that data pre-loaded as a small blob up front seems like it should probably be a net win. It would have to be measured. It also means that we have a chunk of RAM that we've "wasted"...

@allisonkarlitskaya
Copy link
Collaborator Author

Another requirement of the "UKI inside the OCI container" approach (and maybe the "UKI generated locally" approach as well): we'd probably want a tool that could scan the UKI to find out which blobs its refers to in the digest store. This is important for pruning the store when removing old images.

@travier
Copy link
Member

travier commented Sep 5, 2024

One part of implementing this idea is to adapt https://github.com/ostreedev/ostree/blob/main/src/switchroot/ostree-prepare-root.c to use this EROFS instead of looking at the sysroot.

@travier
Copy link
Member

travier commented Sep 9, 2024

Here is a potential flow where we could use that feature that would help us workaround SELinux issues and remove the need for build time commits:

  1. Build via a Containerfile:
# "Normal" build part where you customize your image
FROM base-image as target
RUN Make changes here as needed

# Use a side image to build the composefs & UKI
FROM target as builder
RUN Rebuild SELinux policy
RUN - Do an ostree commit with the changes (i.e. we need to figure out what changed)
    - using the context from the updated SELinux policy
    - and get the full composefs EROFS for the final root
RUN Compress and append the EROFS blob to the initramfs in a pre-defined place
RUN Install ukify & Secure Boot signing tools
RUN Build a UKI with the kernel, initramfs, command line config from the container image and sign it, output to /uki

# Go back to the final image and include just the UKI
FROM target
COPY --from builder /uki /uki
  1. Then on the final system we would do:
  • ostree container image pull which will import all the objects from the "target" image, including the UKI. We will just ignore the xattrs and SELinux labels.
  • Copy the UKI from the imported ostree commit to the ESP
  • Do the rename dance to get it in the right order for boot
  • Reboot

We tried something similar while prototyping: https://github.com/travier/fedora-coreos-uki/blob/main/fcos-uki/Containerfile

@travier
Copy link
Member

travier commented Sep 9, 2024

The major change with this approach is that we clearly split the file content from the metadata and the container becomes a way to only transport object data plus a UKI which includes all the metadata. Thus the deployed rootfs becomes an object store only and we don't "care" about ostree commits anymore as we don't need to sign them or use them to regenerate the composefs metadata on the systems.

@cgwalters
Copy link
Contributor

Another requirement of the "UKI inside the OCI container" approach (and maybe the "UKI generated locally" approach as well): we'd probably want a tool that could scan the UKI to find out which blobs its refers to in the digest store. This is important for pruning the store when removing old images.

Yes. Combining with this comment in general it argues for some new tooling - not too large or complex tooling but new tooling nevertheless. One option is to implement it in this repo as a build-time option - a variant of that is to implement it in Rust (also in this repo). Maybe something like a composefs-boot crate?

@cgwalters
Copy link
Contributor

I chatted with @allisonkarlitskaya about this and there's a lot to like about the simplicity of this approach - I'm 100% on board with continuing investigation of this direction.

My biggest concern was that I'd also really like to build the story of using composefs for apps/extensions/configmaps etc. and this model reduces the alignment between those two approaches.

Combining, this issue also intersects strongly with #294 where I was trying hard to think of a way to bring OCI metadata under verity protection. Hmmm...I guess probably the simplest variant that would work for this is to require the UKI to always be in a distinct layer (with a special annotation like composefs.boot or something), and the manifest that gets included inside the image doesn't have that layer.
Also worth thinking about here is the related issue I was thinking about around how we store individual layers. We must support only fetching changed layers across upgrades.

@cgwalters cgwalters added area/booting Issues related to booting with composefs enhancement New feature or request labels Sep 9, 2024
@travier
Copy link
Member

travier commented Sep 13, 2024

In #332 (comment), I forgot that we still need to do the 3-way merge for /etc so we still a "deployment" of it, so this is a bit more complex.

@travier
Copy link
Member

travier commented Sep 13, 2024

We've also realized that including the composefs EROFS file in the UKI means that it is now public, thus the the file listing and metadata is public. This is not really an issue but just something to be aware of.

@jbtrystram
Copy link

We've also realized that including the composefs EROFS file in the UKI means that it is now public, thus the the file listing and metadata is public. This is not really an issue but just something to be aware of.

(when using LUKS on the rootfs)

@cgwalters
Copy link
Contributor

In #332 (comment), I forgot that we still need to do the 3-way merge for /etc so we still a "deployment" of it, so this is a bit more complex.

For ostree yes, though we also support etc.transient where that wouldn't be needed.

I think in theory we could ship initramfs glue in this project such that "mount composefs from initramfs" logic could in theory be very agnostic, i.e. we have:

  • sysroot.mount
  • composefs-mount.service (replaces /sysroot with a composefs setup, with backing objects in something native like /composefs/objects maybe? But the backing store can be configured in some way (an xattr on the cfs? a config file?))
  • ostree-prepare-root.service (mounts /etc and /var in the way ostree does it today using the physical root, which also note the intersection with Canonical method to find backing filesystem (and block device) #280 ), but the ostree bits could obviously be replaced with something else for non-ostree consumers
  • initrd-root-fs.target
  • ...
  • switchroot

@cgwalters
Copy link
Contributor

We've also realized that including the composefs EROFS file in the UKI means that it is now public, thus the the file listing and metadata is public. This is not really an issue but just something to be aware of.

Instead of "public" I would say "not encrypted on disk" to be clear. "public" often implies to me "accessible to the whole Internet" but for images generated on premise and deployed to servers that are physically secured, I wouldn't say the UKIs here are "public".

That said...AFAIK there's nothing that would block someone from encrypting the erofs in the initramfs, and decrypting using e.g. a key stored in the TPM or something.

@ericcurtin
Copy link

ericcurtin commented Sep 27, 2024

As regards composefs/erofs inside of a UKI, this wouldn't work so well for CentOS Automotive Stream Distribution/Red Hat In-Vehicle OS. Two reasons.

We spent a lot of time minimising initramfs for super-fast boots, we are talking < 10M in size and < 2 seconds in boot time. Now we do have to read the whole composefs eventually for verification. During the initial read of the UKI, userspace cannot proceed with anything until the whole UKI is read, decompressed and the kernel populates the initramfs filesystem.

The other reason is we run on some platforms that have a hard limit of 64M/32M for kernel+cmdline+dtb+initramfs combined. We have a stripped down kernel for this purpose also.

In Automotive can fork a little from the technique decided on here, we already do that as one of the users of composefs.

All we need to do is store a digest in initramfs to ensure what we are booting is what we intend and many of these concerns go away.

Also tagging @alexlarsson he'd likely be interested in a read here.

@ericcurtin
Copy link

ericcurtin commented Sep 27, 2024

In fact, and I've discussed this with the systemd guys once or twice and they agree. initramfs is a dated filesystem, we should keep it as small (and as irrelevant) as possible. There are more efficient ways of creating volatile throwaway verified filesystems these days (composefs, erofs, overlayfs, fs-verity, dm-verity, etc.). Also, if one is referencing erofs inside you cannot unmount the initramfs.

@cgwalters
Copy link
Contributor

cgwalters commented Sep 27, 2024

We're not going to break C9S auto. We will continue to support the way things work today. A big advantage of composefs is flexibility - there's multiple ways to do things (at the same time of course we don't want to support too many paths).

The advantage of the "rootfs-meta-in-initramfs" model in a nutshell is there is no extra keys/signatures required other than the Secure Boot one. But again, the existing way ostree+composefs works will clearly continue to work - and isn't specific to ostree, it's just "key in the initramfs covers verifies signature of digest of composefs".

The other reason is we run on some platforms that have a hard limit of 64M/32M for kernel+cmdline+dtb+initramfs combined. We have a stripped down kernel for this purpose also.

That said I think a general approach many use cases (including yours) should be going to is keeping the main root small anyways and having most of the bits in containers, i.e. mount real root, pivot, then go into the real root, mount further container images via composefs dynamically (verifying their signature however one wants...etc.)

Let's be a bit more specific: how big is your initramfs today? (nevermind, < 10M), How big is the composefs for it?

@amnoni
Copy link

amnoni commented Sep 27, 2024 via email

@ericcurtin
Copy link

I recommend people take at the Android Boot Image and composefs implementation in cs9 auto FWIW. Android Boot Image is a kernel+dtb+cmdline+initramfs blob, it's very similar to UKI.

We're not going to break C9S auto. We will continue to support the way things work today. A big advantage of composefs is flexibility - there's multiple ways to do things (at the same time of course we don't want to support too many paths).

Understood, this feedback is not intended to block any efforts.

The advantage of the "rootfs-meta-in-initramfs" model in a nutshell is there is no extra keys/signatures required other than the Secure Boot one. But again, the existing way ostree+composefs works will clearly continue to work - and isn't specific to ostree, it's just "key in the initramfs covers verifies signature of digest of composefs".

I think it's easy to extend the trust from secure boot key to rootfs, just chain checksums/digests.

The other reason is we run on some platforms that have a hard limit of 64M/32M for kernel+cmdline+dtb+initramfs combined. We have a stripped down kernel for this purpose also.

That said I think a general approach many use cases (including yours) should be going to is keeping the main root small anyways and having most of the bits in containers, i.e. mount real root, pivot, then go into the real root, mount further container images via composefs dynamically (verifying their signature however one wants...etc.)

We should favour containers when possible, but there are cases where we cannot use containers. I think there are more scalable solutions than this.

Let's be a bit more specific: how big is your initramfs today? (nevermind, < 10M), How big is the composefs for it?

I'll build an OS image sometime for exact measurements, need to leave early today for a wedding, so it will be Monday...

There are also no definite sizes for these things in the Automotive OS, but I'll post the minimal sizes anyway. A partner may want to add something to initramfs, may want to add a camera application to composefs (these advanced camera applications can be huge and are not suitable for containers).

@cgwalters
Copy link
Contributor

I think it's easy to extend the trust from secure boot key to rootfs, just chain checksums/digests.

One detail here is that assuming you do the "transient key" model, you throw away reproducible builds - was mentioned in an ASG talk. The "static key" model solves that, but...hmm, I think has other problems.


Anyways...wait...why don't we just have the expected fsverity digest of the composefs in the UKI as e.g. /usr/lib/composefs/rootfs.digest or something that and then we know to look in /objects/<digest>, verify its digest against the expected and mount that? Why would it be any harder than that? I feel like I must be missing something...a bit sleep deprived but I can't think of any problems.

@ericcurtin
Copy link

ericcurtin commented Sep 27, 2024

I think it's easy to extend the trust from secure boot key to rootfs, just chain checksums/digests.

One detail here is that assuming you do the "transient key" model, you throw away reproducible builds - was mentioned in an ASG talk. The "static key" model solves that, but...hmm, I think has other problems.

Anyways...wait...why don't we just have the expected fsverity digest of the composefs in the UKI as e.g. /usr/lib/composefs/rootfs.digest or something that and then we know to look in /objects/<digest>, verify its digest against the expected and mount that? Why would it be any harder than that? I feel like I must be missing something...a bit sleep deprived but I can't think of any problems.

^ This is what I mean "rootfs.digest" file concept... That's basically what we do in the automotive distro, it scales better...

@cgwalters
Copy link
Contributor

That's basically what we do in the automotive distro, it scales better...

Is it though? Aren't you using the ostree+composefs integration which does this with signature covering the ostree commit, which has the composefs digest? That's all that ostree-prepare-root.service does today...and hence it requires a key.

I think what happened is probably a conceptual overlap between the ostree commit and the composefs. Today composefs is just awkwardly glued onto the side of ostree (not a criticism, doing more starts to get hard, but now we're at that point where doing the hard things is worth it for a cleanup).

But yes we could change ostree-prepare-root.service to look for /usr/lib/ostree/composefs.meta in the initramfs which would be a pair of:

  • path to composefs blob (which today is deployment specific)
  • its expected digest

Hmm maybe yes the conceptual conflict was between ostree commits and the composefs blob, but if we're treating it as canonical then yeah my instinct here is:

  • Invent /composefs/objects as a recommended standard thing (or well, I guess it could be /usr/lib/composefs/objects in the physical root...dunno)
  • Change ostree to also link() the ostree-composefs into that directory based on its fsverity digest
  • Now we don't need both a path and a digest in the initramfs, and can standardize /usr/lib/composefs/rootfs.digest per above

@alexlarsson
Copy link
Collaborator

Anyways...wait...why don't we just have the expected fsverity digest of the composefs in the UKI as e.g. /usr/lib/composefs/rootfs.digest or something that and then we know to look in /objects/<digest>, verify its digest against the expected and mount that? Why would it be any harder than that? I feel like I must be missing something...a bit sleep deprived but I can't think of any problems.

Generally this doesn't work with ostree because the UKI is stored in the ostree tree, so it becomes a recursive cycle. We break the cycle by using the one-time key.

@alexlarsson
Copy link
Collaborator

It could work in a system where the UKI and the rootfs are completely independent though.

@cgwalters
Copy link
Contributor

cgwalters commented Sep 27, 2024 via email

@jbtrystram
Copy link

jbtrystram commented Sep 27, 2024

But we can also break that cycle via just excluding the UKI from the composefs. That seems quite simple to do.

That is the conclusion we came to with @travier . It's okay because the uki is signed so not having it covered by fsverity does not matter

@allisonkarlitskaya
Copy link
Collaborator Author

It could work in a system where the UKI and the rootfs are completely independent though.

This is the situation I have in my head. I imagine that we have a system image in the form of a container and a UKI somewhere in a "special" path in that image that does not become part of the composefs, but goes directly into the EFI ESP.

@alexlarsson
Copy link
Collaborator

Ok, so the plan is something like this:

  • Generate the container rootfs without the uki
  • Convert to OCI tarball
  • Create a composefs for the tarball
  • Generate the UKI, stuff the composefs image in it, sign the UKI
  • Append the UKI file to the tarball as /boot.uki
  • On deploy, extract UKI into ESP partition, and all files except /boot.uki into the object store.

I think this works, although I would also like to propose this alternative that may work better for the automotive usecase where we want to minimize the initrd:

  • Generate the container rootfs without the uki
  • Convert to OCI tarball
  • Create a composefs for the tarball
  • Compute the composefs digest
  • Append the composefs image to the tarball as /.rootfs.cfs
  • Generate the UKI, stuff the composefs digest in it, sign the UKI
  • Append the UKI file to the tarball as /.boot.uki
  • On deploy, extract UKI into ESP partition, and all files except /.boot.uki into the object store (note, this will include the composefs image in the object store).
  • On boot, the UKI knows the composefs digest, which it uses to find and validate the composefs image in the object store.

In fact, I can imagine mount.cfs supporting this mode of "composefs image is in object store" approach natively, so you just say "here is the object store, mount $digest at $path".

This approach does store the metadata for the rootfs twice in the tarball, so we could alternatively regenerate the composefs from the tarball metadata if we just skip the uki.

@cgwalters
Copy link
Contributor

cgwalters commented Sep 30, 2024

Ok, so the plan is something like this:

That "composefs in UKI" was indeed something like what was originally proposed here, although I think the "digest in UKI" is not much harder to implement.

I think this works, although I would also like to propose this alternative that may work better for the automotive usecase where we want to minimize the initrd:

Yes, let's call this "digest in UKI"

Convert to OCI tarball

Not sure what you mean here, I don't like the implication it's just one tarball - that's throwing away a huge advantage of OCI. Let me try outlining the steps as I see them:

  • Start with all basic content needed for the rootfs and UKI (i.e. we otherwise have the structure for the initramfs, selinux labeling for the rootfs - everything before this penultimate step). We also have ready the setup for "chunking" the image into layers (optional in theory but it's really not hard to do and hence only the simplest/dumbest build system wouldn't do it). At this point in the process let's say that the rootfs is an OCI image, and the UKI is its otherwise prepared component parts (kernel, initramfs, etc)
  • Compute the canonical composefs digest of the rootfs (flattening the layers for that computation)
  • Append that digest as /usr/lib/composefs/sysroot.digest to the initramfs. We have a composefs-mount.service with ConditionPathExists=/usr/lib/composefs/sysroot.digest that keys off the presence of that and mounts it at /sysroot in the initramfs.
  • Stitch the UKI together, make it a new layer in the OCI image that stores it in the standard place. That layer also has an annotation like composefs.uki-digest=<digest> for the rootfs, and must be the final layer in order to be used

On deploy, extract UKI into ESP partition, and all files except /.boot.uki into the object store (note, this will include the composefs image in the object store).

Right, although I'd clarify that's just because the process will regenerate the composefs, not because it was extracted. For GC purposes, it will be handy to have the digest linked from the manifest.

In fact, I can imagine mount.cfs supporting this mode of "composefs image is in object store" approach natively, so you just say "here is the object store, mount $digest at $path".

Yeah.

This approach does store the metadata for the rootfs twice in the tarball, so we could alternatively regenerate the composefs from the tarball metadata if we just skip the uki.

My proposed variant doesn't, we regenerate the composefs locally. I don't see the need to ship it explicitly in this model (any more than we do with ostree today).


EDIT: I should emphasize none of this whole design requires or is tied to OCI - one could replicate something like this with DDIs or whatever else too. It's just useful to use OCI as a reference design target.

@alexlarsson
Copy link
Collaborator

  • Stitch the UKI together, make it a new layer in the OCI image that stores it in the standard place. That layer also has an annotation like composefs.uki-digest=<digest> for the rootfs, and must be the final layer in order to be used

Isn't a layer like this problematic when combining images, like in a dockerfile. Each bootable image (including a bootc base image) must have a layer like this, and when use use Dockerfiles to create new layers you will end up with multiple layers like this, no?

Otherwise I think that sounds fine.

@cgwalters
Copy link
Contributor

For this case let's assume that the build system is capable of full control over the final image structure; this means that people building OCI images for this won't look like a plain Dockerfile. The two paths for full control are the FROM oci-archive trick or a process which accepts an existing already built container image as input and rewrites it.

@allisonkarlitskaya
Copy link
Collaborator Author

I'm fine with having a digest or the entire file in the UKI. I think I mostly want to kill this signing-key idea and turn it into some kind of a definitive hard link where we don't have to trust that the signing key was only used once.

I think we maybe think too much about what's in the object database and what's not. I'd be happy putting the erofs image itself into the object database, or also the UKI. At the end of these days, these things are just blobs of data which have identifiers and some of those blobs can sometimes refer to other blobs in certain ways. I think the (only( think we need to abandon to eliminate the dependency loop is that idea that the kernel image will appear as part of the root filesystem at runtime.

@allisonkarlitskaya
Copy link
Collaborator Author

I think we've decided not to do this.

@allisonkarlitskaya allisonkarlitskaya closed this as not planned Won't fix, can't repro, duplicate, stale Oct 7, 2024
@travier
Copy link
Member

travier commented Oct 7, 2024

Could you clarify why do you think we should not do this?

Never mind, I read the backlog of comments that I had missed. Sorry.

@travier
Copy link
Member

travier commented Oct 7, 2024

#332 (comment) looks like a good plan. Note that the actual place where the UKI is stored in the image does not really matter as it won't be part of the final rootfs as it won't be in the composefs, but putting it in a standard place is also good. We however will have to make sure that we don't include the content of this special layer when regenerating the composefs blob.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/booting Issues related to booting with composefs enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants