Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCI Image boot quirks or "The road to FROM fedora: 40" #2

Open
antheas opened this issue Aug 5, 2024 · 4 comments
Open

OCI Image boot quirks or "The road to FROM fedora: 40" #2

antheas opened this issue Aug 5, 2024 · 4 comments

Comments

@antheas
Copy link
Collaborator

antheas commented Aug 5, 2024

Lets use this issue to track quirks inherent to booting an OCI image. Hopefully, when this issue closes it will be possible for someone to boot an image made with FROM fedora:40

First lets begin with some background about the quirks

Background

Currently, an OSTree based system can only boot an OSTree commit. OSTree commits are essentially a serialization format for a filesystem, such as a tarball, with the benefit of being able to be deduplicated on a file level.

To make that directory bootable and memoryless ("without hysterisis"), the OSTree project contains a variety setup steps, in which e.g., initramfs is generated and placed in /usr/lib, /etc files are moved to /usr/etc etc.

These steps are done using the tool rpm-ostree using its image generation backend and can currently only be done exclusively with that tool. In addition, rpm-ostree contains a couple of systemd services that fixup OS quirks (e.g., generating /var from a location called var factory).

Then, the filesystem is wrapped into a commit, and placed into an HTTP2 enabled server, where users can download new system files when an update happens.

While revolutionary, this system had the following disadvantages:

  • Cannot keep up with internet speeds. Regardless of whether HTTP2 is used, performing random file requests an an HTTP host is CPU intensive.
  • Not possible to extend
  • The tree file format, while logical, is very hard to adopt.

OCI extension

Therefore, ostree-rs-ext was developed with a new serialization format, which converts an OSTree commit to an OCI image. This standard embeds the OSTree commit as an OSTree repository with xattr format in the /sysroot/ostree directory. Then, as the commit is written to the tar stream, the ostree files are hardlinked to the location they would have in the system (e.g., /usr/etc OSTree files are hardlinked to /etc).

The benefit of this format is that it makes it possible to run the result as a container and extend it.

This is why Bazzite is possible.

A trivial compression format splits this across 64 layers to make it easier to download and make some bandwidth savings possible.

When rpm-ostree receives that image, it first checks if it is a commit that has not been extended. If it is not, it imports it as usual. If has been extended, it imports the OSTree layers as an original "base" commit. The directory permissions are also sourced by the commit, which might and are different in the final container.

Then, for the extension layers, OSTree converts them to small commits on the fly, by using the base commit for SELinux labelling and moving /etc files to /usr/etc. This means that any extensions added over OCI have not been postprocessed and have quirks.

For example, the /etc/passwd file has drift. And since only the base commit is used for SELinux labelling, any package additions with custom SELinux rules break.

And of course, if there is no base commit, rpm-ostree will not load the image.

Bootc

Now, bootc comes along and formalizes the notion of OCI as OS images. Initially, it uses ostree-rs-ext to do the unencapsulation. However, soon it will use podman to pull and expand the container, which is then fed to OSTree (containers/bootc#215). This solves the SELinux issues but introduces a set of new ones.

The codebase of that PR was referenced when building rechunk and, surprisingly, the resulting image did not boot. Therefore, when that PR merges bootc will stop being able to boot extended images.

Why?

A lot of minor reasons.
Because the OCI container might have wrong permissions in certain systemd dirs which make it fail to boot. Maybe the container has both an /etc and /usr/etc dir, which OSTree does not like at all, but due to the way rpm-ostree is implemented right now it works (/etc files are transparently merged to /usr/etc). Maybe the polkitd folder lost the polkitd group and broke. Podman rootless may break because newuidmap has broken capabilities. And so on (see https://github.com/hhd-dev/rechunk/blob/master/1_prune.sh) with even more quirks we do not know about.

TLDR

In order for FROM fedora:40 to be possible, the following need to happen:

  • The postprocessing applied by rpm-ostree needs to be documented
  • The loss of attributes needs to be documented (file capabilities, xattrs, SELinux)
  • In case attributes are missing, before the image is deployed it needs to be "quirked" to have correct permissions (e.g., adding polkitd to /usr/etc/polkit-1/rules.d)
  • Both bootc (when deploying arbitrary images) and rechunk (when preparing OSTree commits) need to implement them so that there is no drift between the two implementations (e.g., users can skip rechunk when testing an image and deploy it straight with bootc).

Of course, there is still value in using ostree encapsulated commits in a bootc world:

  • Maintains compat with rpm-ostree
  • Lower layer invalidation means less lookups when updating AND less committed files to OSTree
  • There is no need for unrolling the original image for SELinux labelling
  • There is no need for quirking the original image, which takes time
  • Composefs can be precomputed
  • Users can still extend the image arbitrarily

For most users that not developers, it does not make sense for them to have to eat the update cost for distro maintainer DX, especially when rechunk can fixup the image in 7 min.

Tagging @cgwalters as the discussion with containers/bootc#215 affects bootc

@antheas
Copy link
Collaborator Author

antheas commented Aug 5, 2024

Right now, the following permissions differences have been identified between booting an OCI image after stripping the base commit and an OCI image with a base commit:

The following dirs have different permissions:

chmod 750 ./usr/etc/audit
chmod 750 ./usr/etc/audit/rules.d
chmod 755 ./usr/etc/bluetooth
chmod 750 ./usr/etc/dhcp
chmod 750 ./usr/etc/firewalld
chmod 700 ./usr/etc/grub.d
chmod 700 ./usr/etc/nftables
chmod 700 ./usr/etc/nftables/osf
chmod 555 ./usr/etc/pki/ca-trust/extracted/pem/directory-hash
chmod 750 ./usr/etc/polkit-1/rules.d
chmod 700 ./usr/etc/ssh/sshd_config.d
chmod 700 ./usr/lib/containers/storage/overlay-images
chmod 700 ./usr/lib/containers/storage/overlay-layers
chmod 700 ./usr/lib/ostree-boot/efi
chmod 700 ./usr/lib/ostree-boot/efi/EFI
chmod 700 ./usr/lib/ostree-boot/efi/EFI/BOOT
chmod 700 ./usr/lib/ostree-boot/efi/EFI/fedora
chmod 700 ./usr/lib/ostree-boot/grub2
chmod 700 ./usr/lib/ostree-boot/grub2/fonts
chmod 750 ./usr/libexec/initscripts/legacy-actions/auditd

Which makes systemd panic, and sddm not able to launch

The following bins have different capabilities that have been stripped (probably due to ostree-rs-ext' gzip encoding):

setcap cap_dac_override,cap_net_admin,cap_net_raw=eip ./usr/bin/dumpcap
setcap cap_sys_nice=ep ./usr/bin/kwin_wayland
setcap cap_setgid=ep ./usr/bin/newgidmap
setcap cap_setuid=ep ./usr/bin/newuidmap
setcap cap_net_bind_service=ep ./usr/bin/rcp
setcap cap_net_bind_service=ep ./usr/bin/rlogin
setcap cap_net_bind_service=ep ./usr/bin/rsh

The following dirs lose the polkitid group perm since polkitd is no longer in /etc/group but instead on /usr/lib/group:

chgrp $POLKIT_ID ./usr/etc/polkit-1/localauthority
chgrp $POLKIT_ID ./usr/etc/polkit-1/rules.d

This causes polkits to not work

@antheas
Copy link
Collaborator Author

antheas commented Aug 5, 2024

Both rpm-ostree and the bootc PR do not do the following (when the additions are through OCI):

Remove /etc lockfiles:

rm -rf \
    ./etc/.pwd.lock \
    ./etc/passwd- \
    ./etc/group- \
    ./etc/shadow- \
    ./etc/gshadow- \
    ./etc/subuid- \
    ./etc/subgid- \
    ./.dockerenv

Update /usr/lib/passwd and /usr/lib/group based on the additions on /etc/passwd and /etc/group. In addition, ostree-rs-ext does not copy said files to /etc so that they are not available in the container runtime.

They do not handle /var/lib to /usr/lib, /var to /usr/share/factory/var.

Bootc does not merge /usr/etc to /etc before moving /etc to /usr/etc which rpm-ostree does implicitly. That means that if someone creates a file in /usr/etc bootc will fail (all ublue images scatter files to /usr/etc through mostly confusion and would fail to boot).

rpm-ostree stashes around 300mb of data in /usr/lib/sysimage/rpm-ostree-base-db/ and it is unclear for what they are used for. Rechunk removes those.

@cgwalters
Copy link

The postprocessing applied by rpm-ostree needs to be documented

This relates strongly to https://gitlab.com/fedora/bootc/base-images-experimental and https://gitlab.com/fedora/bootc/tracker/-/issues/32

@cgwalters
Copy link

There's a whole lot going on in this project (thanks for starting it!)...I think though we are going to need to tease some of these sub-problems apart and tackle them more clearly individually.

Especially the:

In order for FROM fedora:40 to be possible

part.


BTW, when I was looking at this one of the just fundamental "sand in the gears" going on here is containers/buildah#5592 - and really the only way to work around that today is to step "outside" and reserialize (or at least fix up) the tar streams generated by podman/docker.

(Yes, we should fix that bug)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants