Skip to content

Conversation

@AkihiroSuda
Copy link
Member

@AkihiroSuda AkihiroSuda commented Sep 8, 2022

Dockerfile has VOLUME /home/user/.local/share/buildkit by default too, but the default VOLUME does not work with rootless on Google's Container-Optimized OS as it is mounted with nosuid,nodev.

So the volume has to be explicitly mounted as an emptyDir volume.

Tested with GKE Autopilot 1.24.3-gke.200 (kernel 5.10.123+, containerd 1.6.6).

Fix #879

Thanks to Andrew Grigorev (@ei-grad) and Ben Cressey (@bcressey).

@AkihiroSuda AkihiroSuda changed the title rootless: support Google Container-Optimized OS and Amazon Bottlerocket OS rootless: support Google Container-Optimized OS and Amazon Bottlerocket OS (Fix Options:[rbind ro]}]: operation not permitted errors) Sep 8, 2022
@bcressey
Copy link

bcressey commented Sep 9, 2022

Interesting! This doesn't actually fix the issue on Bottlerocket; the emptyDir mount still has the problematic nosuid,nodev flags:

/dev/nvme1n1p1 on /home/user/.local/share/buildkit type ext4 (rw,seclabel,nosuid,nodev,noatime)

It's great that it works on GKE and GCOS though. I wonder if it's because the backing directory for emptyDir mounts there is a bind mount that's been remounted with dev,suid. Rather than a change here, that would point to the need for a corresponding fix in Bottlerocket so this works as expected.

@AkihiroSuda any chance you could check your GCOS host (via findmnt -o target,vfs-options or mount) to see if either /var/lib/kubelet or the pod-specific kubernetes.io~empty-dir volume is mounted with different options?

@AkihiroSuda AkihiroSuda marked this pull request as draft September 9, 2022 07:07
Dockerfile has `VOLUME /home/user/.local/share/buildkit` by default too,
but the default VOLUME does not work with rootless on Google's Container-Optimized OS
as it is mounted with `nosuid,nodev`.

So the volume has to be explicitly mounted as an `emptyDir` volume.

Tested with GKE Autopilot 1.24.3-gke.200 (kernel 5.10.123+, containerd 1.6.6).

Fix issue 879

Thanks to Andrew Grigorev (ei-grad) and Ben Cressey (bcressey).

Signed-off-by: Akihiro Suda <[email protected]>
@AkihiroSuda AkihiroSuda changed the title rootless: support Google Container-Optimized OS and Amazon Bottlerocket OS (Fix Options:[rbind ro]}]: operation not permitted errors) rootless: support Google Container-Optimized OS (Fix Options:[rbind ro]}]: operation not permitted errors) Sep 9, 2022
@AkihiroSuda AkihiroSuda marked this pull request as ready for review September 9, 2022 08:25
@AkihiroSuda
Copy link
Member Author

This doesn't actually fix the issue on Bottlerocket; the emptyDir mount still has the problematic nosuid,nodev flags:

/dev/nvme1n1p1 on /home/user/.local/share/buildkit type ext4 (rw,seclabel,nosuid,nodev,noatime)

Thanks for the info 👀 , removed Bottlerocket from the PR description.

any chance you could check your GCOS host (via findmnt -o target,vfs-options or mount) to see if either /var/lib/kubelet or the pod-specific kubernetes.io~empty-dir volume is mounted with different options?

With emptyDir: /dev/sda1 on /home/user/.local/share/buildkit type ext4 (rw,relatime,commit=30)

$ kubectl exec buildkitd -- mount
W0909 17:28:16.661963    2250 gcp.go:119] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.26+; use gcloud instead.
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/208/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/207/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/206/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/205/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/204/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/203/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/209/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/209/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
/dev/sda1 on /etc/hosts type ext4 (rw,relatime,commit=30)
/dev/sda1 on /dev/termination-log type ext4 (rw,relatime,commit=30)
/dev/sda1 on /etc/hostname type ext4 (rw,nosuid,nodev,relatime,commit=30)
/dev/sda1 on /etc/resolv.conf type ext4 (rw,nosuid,nodev,relatime,commit=30)
tmpfs on /run/secrets/kubernetes.io/serviceaccount type tmpfs (ro,relatime,size=2097152k)
/dev/sda1 on /home/user/.local/share/buildkit type ext4 (rw,relatime,commit=30)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/acpi type tmpfs (ro,relatime)
tmpfs on /proc/kcore type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/keys type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,relatime)
tmpfs on /sys/firmware type tmpfs (ro,relatime)

Without emptyDir: /dev/sda1 on /home/user/.local/share/buildkit type ext4 (rw,nosuid,nodev,relatime,commit=30)

$ kubectl exec buildkitd-bad -- mount
W0909 17:31:03.192574    2257 gcp.go:119] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.26+; use gcloud instead.
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/210/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/209/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/208/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/207/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/206/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/205/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/211/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/211/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
/dev/sda1 on /etc/hosts type ext4 (rw,relatime,commit=30)
/dev/sda1 on /dev/termination-log type ext4 (rw,relatime,commit=30)
/dev/sda1 on /etc/hostname type ext4 (rw,nosuid,nodev,relatime,commit=30)
/dev/sda1 on /etc/resolv.conf type ext4 (rw,nosuid,nodev,relatime,commit=30)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
tmpfs on /run/secrets/kubernetes.io/serviceaccount type tmpfs (ro,relatime,size=2097152k)
/dev/sda1 on /home/user/.local/share/buildkit type ext4 (rw,nosuid,nodev,relatime,commit=30)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/acpi type tmpfs (ro,relatime)
tmpfs on /proc/kcore type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/keys type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,relatime)
tmpfs on /sys/firmware type tmpfs (ro,relatime)

@AkihiroSuda
Copy link
Member Author

For the long-term solution, we will have to copy this to somewhere in containerd's pkg mount pkg

https://github.com/moby/moby/blob/v20.10.17/daemon/oci_linux.go#L420-L470

// Get the set of mount flags that are set on the mount that contains the given
// path and are locked by CL_UNPRIVILEGED. This is necessary to ensure that
// bind-mounting "with options" will not fail with user namespaces, due to
// kernel restrictions that require user namespace mounts to preserve
// CL_UNPRIVILEGED locked flags.
func getUnprivilegedMountFlags(path string) ([]string, error) {

@AkihiroSuda
Copy link
Member Author

Can we merge this? (w/ docker/buildx#1310)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rootless mode doesn't work on Google Container-Optimized OS kernel (CONFIG_SECURITY_CHROMIUMOS_NO_UNPRIVILEGED_UNSAFE_MOUNTS?)

4 participants