Skip to content

Conversation

@AkihiroSuda
Copy link
Member

@AkihiroSuda AkihiroSuda commented Feb 21, 2025

Fix #5763

  • Discourage --oci-worker-no-process-sandbox, due to the leakage of the processes (by design). Instead, encourage setting systempaths=unconfined in docker run. This corresponds to securityContext.procMount: Unmasked in Kubernetes, however, the configuration is hard on Kubernetes, as it has to be used in conjunction with hostUsers: false.

  • Remove --device /dev/fuse, as fuse-overlayfs is no longer used typically.

  • Use the new Kubernetes struct for AppArmor

  • Add a hint about kernel.apparmor_restrict_unprivileged_userns

  • Remove $ from command snippets for ease of copypasting

  • Make job.*.yaml more practical

  • Add *.userns.yaml. Needs UserNamespaceSupport feature gate to be enabled.


TODO: update buildx to support UserNS mode too

Fix issue 5763

- Discourage `--oci-worker-no-process-sandbox`, due to the leakage of
  the processes (by design).
  Instead, encourage setting `systempaths=unconfined` in `docker run`.
  This corresponds to `securityContext.procMount: Unmasked` in Kubernetes,
  however, the configuration is hard on Kubernetes, as it has to be used
  in conjunction with `hostUsers: false`.

- Remove `--device /dev/fuse`, as fuse-overlayfs is no longer used typically.

- Use the new Kubernetes struct for AppArmor

- Add a hint about `kernel.apparmor_restrict_unprivileged_userns`

- Remove `$` from command snippets for ease of copypasting

- Make `job.*.yaml` more practical

- Add `*.userns.yaml`. Needs `UserNamespaceSupport` feature gate to be enabled.

Signed-off-by: Akihiro Suda <[email protected]>
Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with the updates but isn't there a way to fix the specific process leak case if we have a reproducer? Even if we can't make it 100% guaranteed for other cases.

@AkihiroSuda
Copy link
Member Author

I'm ok with the updates but isn't there a way to fix the specific process leak case if we have a reproducer? Even if we can't make it 100% guaranteed for other cases.

Potentially we may use seccomp (or ptrace) to catch fork, clone, execve, etc. to track the leaked processes?

@tonistiigi
Copy link
Member

Potentially we may use seccomp (or ptrace) to catch fork, clone, execve, etc. to track the leaked processes?

That's an option, but maybe there is something simpler. What about cgroups? I don't know what the exact case is in here.

@AkihiroSuda
Copy link
Member Author

AkihiroSuda commented Feb 26, 2025

Potentially we may use seccomp (or ptrace) to catch fork, clone, execve, etc. to track the leaked processes?

That's an option, but maybe there is something simpler. What about cgroups? I don't know what the exact case is in here.

Not sure

$ docker run -it --rm --security-opt seccomp=unconfined --security-opt apparmor=unconfined --user ubuntu ubuntu
ubuntu@552de0932c50:/$ unshare -rmC     
# mount -t cgroup2 none /sys/fs/cgroup
mount: /sys/fs/cgroup: none already mounted or mount point busy.
       dmesg(1) may have more information after failed mount system call.
# mount -t tmpfs none /sys/fs/cgroup
# mount -t cgroup2 none /sys/fs/cgroup
# mkdir /sys/fs/cgroup/foo
mkdir: cannot create directory '/sys/fs/cgroup/foo': Permission denied

@AkihiroSuda
Copy link
Member Author

In containerd v2.1 we will get writable cgroups though:

@AkihiroSuda
Copy link
Member Author

Can we merge?

@crazy-max crazy-max merged commit 1c41f9b into moby:master Mar 4, 2025
104 checks passed
@crazy-max crazy-max added this to the v0.20.1 milestone Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

processes can remain active after build finishes with --oci-worker-no-process-sandbox

4 participants