Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkProcMount() is too strict #2826

Open
JonathonReinhart opened this issue Feb 26, 2021 · 5 comments · May be fixed by #3224
Open

checkProcMount() is too strict #2826

JonathonReinhart opened this issue Feb 26, 2021 · 5 comments · May be fixed by #3224
Milestone

Comments

@JonathonReinhart
Copy link

JonathonReinhart commented Feb 26, 2021

TL;DR checkProcMount() won't let me mount /proc/sys/net as read-write.

I'm trying to run a libvirt KVM VM inside of a docker container without using --privileged. I've worked around a lot of other errors by:

  • Adding /dev/kvm and /dev/net/tun devices
  • Granting CAP_NET_ADMIN (safe: net-namespaced)
  • Mounting /sys/fs/cgroup/* read-write (safe?)
  • Mounting /sys/devices/virtual/net read-write (safe: net-namespaced)

But there's one error I can't work around:

libvirt.libvirtError: cannot write to /proc/sys/net/ipv6/conf/virbr2/disable_ipv6 to enable/disable IPv6 on bridge virbr2: Read-only file system

What I would like to do is allow /proc/sys/net to be mounted read-write inside of the container. My understanding is that this is safe because everything in that subdirectory is net-namespaced, so a container can't affect the host net ns. (I would have to audit some kernel code to be sure, but it's certainly better than --privileged).

The problem is that checkProcMount() won't let me:

$ docker version
Client: Docker Engine - Community
 Version:           20.10.3
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        48d30b5
 Built:             Fri Jan 29 14:33:25 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.3
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       46229ca
  Built:            Fri Jan 29 14:31:38 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0


$ docker run --rm -it -v "/proc/sys/net:/proc/sys/net:rw" debian:10
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:59: mounting "/proc/sys/net" to rootfs at "/proc/sys/net" caused: "/var/lib/docker/overlay2/9fd477a20091dd5d9babf5ce2bddb8d517349c89ba9a4e6c7f74f275f0c370c9/merged/proc/sys/net" cannot be mounted because it is inside /proc: unknown.

My only alternative to --privileged is granting CAP_SYS_ADMIN (for mount(2)) and remounting /proc/sys inside the container. This is a horrible alternative because:

  • CAP_SYS_ADMIN is terribly overloaded
  • /proc/sys has lots of kernel global options which aren't namespaced
@kolyshkin
Copy link
Contributor

Looks like we first need to audit the kernel, and then allow this mount in case the container is running inside its own netns (similar to the netns check done in libcontainer/configs/validate/).

In any case, this looks like a post-1.0 material to me.

@kolyshkin kolyshkin added this to the post-1.0 milestone Mar 4, 2021
@JonathonReinhart
Copy link
Author

I guess a question I would have is: Is it the responsibility of runc to protect the user from making an unsafe bind mount? If we're going for absolute safety, then shouldn't docker's --privileged command be removed?

@cyphar cyphar modified the milestones: post-1.0, 1.1.0 Jun 11, 2021
@kolyshkin
Copy link
Contributor

So, yes, I think we can make an exception for /proc/sys/net mount, in case NETNS is enabled.

It's too late for 1.1.0 I'm afraid though, as we do not even have a PR yet.

Moving milestone to 1.2.0. @JonathonReinhart feel free to open a PR.

@kolyshkin kolyshkin modified the milestones: 1.1.0, 1.2.0 Sep 9, 2021
@JonathonReinhart
Copy link
Author

Hi @kolyshkin. While I was away from this issue, I started down the road of auditing the kernel to verify that everything under /proc/sys/net was netns-safe, meaning either read-only, or implemented per-netns. The results are:

  • I developed this tool: linux-netns-sysctl-verify which tries to find netns-unsafe sysctls
  • I found and fixed three bugs in the kernel where sysctls were not netns-safe
  • I also added a safety check to try and prevent future such mistakes

The details are available on the linux-netns-sysctl-verify README. This was far more work than I ever imagined to get these fixes into the kernel!

With these bugs fixed and making their way into the distros, I am now confident that allowing /proc/sys/net to be mounted read-write inside a NETNS-enabled container is safe.

When I find some more time, I will try to put a PR together for this enhancement.

@kolyshkin
Copy link
Contributor

@JonathonReinhart wow, great work, much appreciated! Looks like we're on the right track with #3224.

JonathonReinhart added a commit to JonathonReinhart/runc that referenced this issue Oct 2, 2021
@kolyshkin kolyshkin modified the milestones: 1.2.0, 1.3.0 Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants