-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rootfs: make pivot_root(2) dance handle initramfs case #4434
base: main
Are you sure you want to change the base?
Conversation
948a7d9
to
28d2e47
Compare
28d2e47
to
39bb12c
Compare
Okay, I managed to test this and this version definitely works. The setup is not too complicated, but I'm not sure if we could practically test this within our CI (what would be a nice way of verifying the container ran inside the VM?). Script used to create the initramfs (openSUSE)#!/bin/bash
set -Eeuo pipefail
#sudo zypper in -y busybox syslinux skopeo umoci
sudo rm -rf boot-img/
[ -e runc ] || curl -sSL "https://github.com/opencontainers/runc/releases/download/v1.2.0/runc.amd64" -o runc
[ -d opensuse ] || skopeo copy docker://opensuse/tumbleweed:latest oci:opensuse:tumbleweed
[ -d bundle ] || sudo umoci unpack --image opensuse:tumbleweed ./bundle
mkdir -vp boot-img/
pushd boot-img
mkdir -p ./usr/bin
ln -sv usr/bin ./bin
# Copy runc.
ln ../runc ./usr/bin/runc
# Copy the rootfs bundle.
mkdir -p run
sudo cp -aR ../bundle ./run/bundle
# openSUSE makes /usr/bin/busybox non-static, and you can't ask busybox.install
# to install the static version. So install busybox using symlinks and then
# replace busybox with busybox-static.
busybox.install . --symlinks
cp -v /usr/bin/busybox-static ./usr/bin/busybox
# Boot into a shell.
cat >./init <<EOF
#!/bin/sh
echo "HELLO WORLD"
mkdir -p /proc
mount -t proc proc /proc
mkdir -p /sys
mount -t sysfs sysfs /sys
mkdir -p /sys/fs/cgroup
mount -t cgroup2 cgroup2 /sys/fs/cgroup
mkdir -p /tmp
mount -t tmpfs tmpfs /tmp
mkdir -p /dev
mount -t devtmpfs devtmpfs /dev
mkdir -p /dev/pts
mount -t devpts -o newinstance devpts /dev/pts
mkdir -p /dev/shm
mount --bind /tmp /dev/shm
/bin/sh
EOF
chmod +x ./init
# Build our init.cpio.
sudo find . | sudo cpio -o -H newc > ../init.cpio
popd And you can then just do |
47dffa5
to
e27cce8
Compare
@kolyshkin @AkihiroSuda Do you want me to try to come up with a CI test for this case? I'm not really sure if there is a nice way of testing the output of qemu (doesn't GitHub block nested VMs as well?). Then again, maybe we could make the init script run runc and just parse the |
I'm afraid it's going to be tough. In case qemu is not working in GHA (last time I checked it was only working on Mac OS X instances, but it was ~2y ago), try cirrus-ci, it uses GCP and with this instance it's possible to use nested virt (which we do to run vagrant-libvirt): Lines 22 to 29 in 4ad9f7f
|
The default Linux instances of GHA now supports nested virt. |
5a291f2
to
0dfe91b
Compare
3c366c2
to
5cd283a
Compare
if err != nil { | ||
// Always try to do pivot_root(2) because it's safe, and only fallback | ||
// to the unsafe MS_MOVE+chroot(2) dance if pivot_root(2) fails. | ||
logrus.Warnf("your container failed to start with pivot_root(2) (%v) -- please open a bug report to let us know about your usecase", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End-users may not understand who are "us", as they don't execute runc directly.
We may link https://github.com/opencontainers/runc/issues for explicitness, but we may potentially get a report about some third-party product that we have never heard of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already provide a link to the tracking issue for this deprecation if you pass --no-pivot
(regardless of whether pivot_root(2)
works), I don't think we need to provide the same link twice?
e3acf8d
to
beaf64a
Compare
Okay, I managed to get this working on AlmaLinux 9 so now it's ready for review again @kolyshkin @AkihiroSuda. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one trivial nit; LGTM. This is nice!
While pivot_root(2) normally refuses to pivot a mount if you are running with / as initramfs (because initramfs doesn't have a parent mount), you can create a bind-mount of / and make that your new root to work around this problem. This does use chroot(2), but this is only done temporarily to set current->fs->root to the new mount. Once pivot_root(2) finishes, the chroot(2) and / are gone. Variants of this hack are fairly well known and is used all over the place (see [1,2]) but until now we have forced users to have a far less secure configuration with --no-pivot. This is a slightly modified version that uses the container rootfs as the temporary spot for the / clone -- this allows runc to continue working with read-only image-based OS images. [1]: containers/bubblewrap#592 (comment) [2]: https://aconz2.github.io/2024/07/29/container-from-initramfs.html Signed-off-by: Aleksa Sarai <[email protected]>
3563055
to
a99f0a8
Compare
run.go
Outdated
Usage: "do not use pivot root to jail process inside rootfs. This should be used whenever the rootfs is on top of a ramdisk", | ||
Usage: "do not use pivot root to jail process inside rootfs to work around limitations when running in an initramfs (this option is deprecated, insecure, and unnecessary now that pivot_root works with initramfs -- see <https://github.com/opencontainers/runc/issues/4435>)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need the same change for create.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The (slightly more bold) alternative is to make the option hidden (in this case the description can be removed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of hiding it for 1.3.0
but keeping it visible for 1.2.z
. I've added a new patch deprecating it, and I'll exclude that from the backport.
Despite the hardenings we've added to the MS_MOVE+chroot dance over the years like commit 28a697c ("rootfs: umount all procfs and sysfs with --no-pivot"), --no-pivot is fundamentally insecure and the primary reason why people use it (to run containers from initramfs) can now be done safely with pivot_root(2). So we should always try to pivot_root(2) and give a warning to the user that their configuration is insecure if we have to use the --no-pivot fallback (users should not see this message in practice, because the primary users that couldn't use pivot_root(2) now can and will transparently use it if possible). Signed-off-by: Aleksa Sarai <[email protected]>
This includes bats_pipe and some other nice features we can use in our tests. Signed-off-by: Aleksa Sarai <[email protected]>
The runc wrapper we have had for a long time is useful for debugging errors (because bats's built-in "run" doesn't provide the output of the program in case of an error). However, it might be necessary to run other programs in a similar wrapper. In addition, it might be needed to add timeouts or use the bats_pipe feature (bats v1.10.0) with similar wrapping. Adding support for this in the sane_run wrapper makes it a little easier to use these things when writing tests. This is somewhat adapted from umoci's sane_run helper. Signed-off-by: Aleksa Sarai <[email protected]>
The only way to run in the proper initramfs is to start a VM using a custom initrd that runs runc. This should be a fairly reasonable smoke-test that matches what minikube and kata do. Unfortunately, running the right qemu for the native architecture on various distros is a little different, so we need a helper function to get it to work on both Debian and AlmaLinux. Signed-off-by: Aleksa Sarai <[email protected]>
Existing users will still be able to use it, but new users won't be tempted into using this flag (and their containers should be able to run without issue without this flag anyway). Signed-off-by: Aleksa Sarai <[email protected]>
a99f0a8
to
363768c
Compare
return &os.PathError{Op: "pivot_root", Path: ".", Err: err} | ||
pivotErr := unix.PivotRoot(".", ".") | ||
if errors.Is(pivotErr, unix.EINVAL) { | ||
// If pivot_root(2) failed with -EINVAL, one of the possible reasons is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are six reasons will cause -EINVAL
:
EINVAL new_root is not a mount point.
EINVAL put_old is not at or underneath new_root.
EINVAL The current root directory is not a mount point (because
of an earlier [chroot(2)](https://man7.org/linux/man-pages/man2/chroot.2.html)).
EINVAL The current root is on the rootfs (initial ramfs) mount;
see NOTES.
EINVAL Either the mount point at new_root, or the parent mount of
that mount point, has propagation type MS_SHARED.
EINVAL put_old is a mount point and has the propagation type
MS_SHARED.
This PR tries to deal with the forth one, does this pr would cause some unexpected behaviors for the fifth one? (Maybe need to check once have a time)
The user case:
For docker, containerd, kubernetes, the rootfs is always mount as a overlay fs, it will always be changed as MS_PRIVATE
by runc, but the parent mount of rootfs may have propagation type MS_SHARED.
Before this patch, it will return an error immediately, but after this patch, it will do some magic operation to let pivot_root(2)
work, I think we should be careful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user case:
For docker, containerd, kubernetes, the rootfs is always mount as a overlay fs, it will always be changed asMS_PRIVATE
by runc, but the parent mount of rootfs may have propagation type MS_SHARED.
For this case, this patch has the same security issue like --no-pivot
, even this flag is not provided. I can escape from the container.
So, I think we should split this PR into two:
- Deprecate the flag
no-pivot
, this should be inmilestone 1.2.1
; - Find a new way to handle initramfs case, this one can be moved to the next version if we can't fix it in a short time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case, this patch has the same security issue like --no-pivot, even this flag is not provided. I can escape from the container.
How? After pivot_root(2)
the whole host mount tree is gone, no? chroot_fs_refs
should move the current->fs->root
to the pivot_root(2)
'd path so the intermediate chroot(2)
should not have any effect on the final mount setup AFAICS. Can you share a reproducer?
For docker, containerd, kubernetes, the rootfs is always mount as a overlay fs, it will always be changed as MS_PRIVATE by runc, but the parent mount of rootfs may have propagation type MS_SHARED.
We also set the parent to MS_PRVIATE
in rootfsParentMountPrivate
. (It already should not be the case that we get -EINVAL
for that reason because we explicitly make sure that it works with rootfsParentMountPrivate
.) Is that not sufficient?
The bind-mount of /
keeps the propagation settings the same as well. So if the propagation was actually wrong, the second pivot_root(2)
would still fail. I also don't see how you could use that to break out of the container -- AFAIK the reason the kernel requires the mount to be !shared
is so that the mount tree changes during pivot_root(2)
don't propagate to the host (and break it).
Find a new way to handle initramfs case, this one can be moved to the next version if we can't fix it in a short time.
The "magic" done is just creating a bind-mount of the root that can be used for pivot_root(2)
(in an MS_PRIVATE
mount so the mount itself will be MS_PRIVATE
, to answer your question).
This is the only way of doing it (I spoke to the VFS maintainers a few weeks ago, and this is the only "right" way of doing it -- FWIW Lennart implied that systemd does this but they don't it seems).
It's a bit hard to understand what the problem is without an actual reproducer.
Deprecate the flag no-pivot, this should be in milestone 1.2.1;
We can't deprecate it without having a solution for minikube and kata...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't deprecate it without having a solution for minikube and kata...
Mark it deprecate not means we remove it right now, it's only to remind users that they should not use it except there is no way to run their case, and to inform users that we can't guarantee the container's security with --no-privot
.
But yes, it's a little strange do it in the situation that we still have no solution for minikube and kata or other projects. I think we can find a solution to fix it. I suggest to split into two because maybe we need more time to find out this solution, but I think we should release v1.2.1
ASAP, then we can let runc v1.2.*
could be tested by more and more users, especially in some downstream rc versions.
We also set the parent to
MS_PRVIATE
inrootfsParentMountPrivate
. (It already should not be the case that we get-EINVAL
for that reason because we explicitly make sure that it works withrootfsParentMountPrivate
.) Is that not sufficient?
No. I have concern about it for a long time after #4417 merged, the commit is: 13a6f56. This commit has changed the behavior of rootfsParentMountPrivate
:
1. Before this commit, we make /
to MS_PRIVATE
mount;
2. After this commit, if the rootfs
is a overlay mount, we will only make rootfs
to MS_PRIVATE
mount.
@kolyshkin I don't know this behavior change will cause some issues, but we will not set parent to MS_PRIVATE
is a fact, but maybe make rootfs
MS_PRIVATE
is enough.
EDIT: Sorry, I'm wrong, the parent mount point has always been changed to MS_PRIVATE
by rootfsParentMountPrivate
; So this bug is not related to this, I have sent the email to @cyphar, if you don't receive it, please let me know.
It's a bit hard to understand what the problem is without an actual reproducer.
I'll send you a email ASAP.
I'm assuming we won't include this in 1.2.1 as this is still a draft. We can include it later, though. |
This looks more like a "new feature" than a "bug fix" for me, and so it is probably 1.3 material. |
Yeah, 1.2.1 was the wrong thing to label it. (Especially since there are some subtleties I need to look into...) |
Removed the @cyphar is this still a draft? |
@kolyshkin Yes, I still need to figure out how to fix the issue that @lifubang found. With the right setup you can end up with the wrong root being mounted in the container, and it's not obvious looking at the source of |
While
pivot_root(2)
normally refuses to pivot a mount if you are runningwith
/
as initramfs (because initramfs doesn't have a parent mount), youcan create a bind-mount of
/
and make that your new root to work aroundthis problem. This does use
chroot(2)
, but this is only done temporarilyto set
current->fs->root
to the new mount. Oncepivot_root(2)
finishes,the
chroot(2)
and/
are gone.Variants of this hack are fairly well known and is used all over the
place (see 1, 2) but until now we have forced users to have a far less
secure configuration with
--no-pivot
. This is a slightly modifiedversion that uses the container rootfs as the temporary spot for the
/
clone -- this allows runc to continue working with read-only image-based
OS images.
Signed-off-by: Aleksa Sarai [email protected]