-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runc run: fix mount leak #4417
runc run: fix mount leak #4417
Conversation
1263560
to
b9ec48e
Compare
What happens if the parent mount is a bind-mount and the container rootfs is also a bind-mount from the same filesystem? Unless I'm mistaken, this device-checking logic will skip those mounts and produce errors when we There are a few ways we could handle this AFAICS:
Maybe we can try to do (1) for newer kernels and fall back to (3) for older kernels, with a final fallback to |
dc0ca3a
to
f3f7aa4
Compare
I just came up with a stupid but simpler alternative. |
f3f7aa4
to
08105c9
Compare
When preparing to mount container root, we need to make its parent mount private (i.e. disable propagation), otherwise the new in-container mounts are leaked to the host. To find a parent mount, we use to read mountinfo and find the longest entry which can be a parent of the container root directory. Unfortunately, due to kernel bug in all Linux kernels older than v5.8 (see [1], [2]), sometimes mountinfo can't be read in its entirety. In this case, getParentMount may occasionally return a wrong parent mount. As a result, we do not change the mount propagation to private, and container mounts are leaked. Alas, we can not fix the kernel, and reading mountinfo a few times to ensure its consistency (like it's done in, say, Kubernetes) does not look like a good solution for performance reasons. Fortunately, we don't need mountinfo. Let's just traverse the directory tree, trying to remount it private until we find a mount point (any error other than EINVAL means we just found it). Fixes issue 2404. [1]: https://github.com/kolyshkin/procfs-test [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f Signed-off-by: Kir Kolyshkin <[email protected]>
08105c9
to
13a6f56
Compare
@cyphar WDYT? Ah, I just found crun does the same thing. This probably means the approach is sound (yet I'm a tad unhappy I'm not the first one who came with this idea 😿). |
I also think we can make a 1.1 backport since the issue, if rare, is quite nasty, and the fix seems to be simple. |
I checked and |
1.1 backport: #4425 |
When preparing to mount container root, we need to make its parent mount private (i.e. disable mount propagation), otherwise the new in-container mounts are leaked to the host.
To find a parent mount, we use to read mountinfo and find the longest entry which can be a parent of the container root directory.
Unfortunately, due to kernel bug in all Linux kernels older than v5.8 (see 1, 2), sometimes mountinfo can't be read in its entirety. In this case, getParentMount may occasionally return a wrong parent mount.
As a result, we do not change the mount propagation to private, and container mounts are leaked.
Alas, we can not fix the kernel, and reading mountinfo a few times to ensure its consistency (like it's done in, say, Kubernetes) does not look like a good solution for performance reasons.
Fortunately, we don't need mountinfo. Let's just traverse the directory tree, trying to remount it private until we find a mount point (any error other than EINVAL means we just found it).
Fixes: #2404.