Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying cgroup limits on a child container fails with cgroups v2 #6288

Open
3 tasks done
Scipi opened this issue Apr 19, 2022 · 17 comments
Open
3 tasks done

Specifying cgroup limits on a child container fails with cgroups v2 #6288

Scipi opened this issue Apr 19, 2022 · 17 comments

Comments

@Scipi
Copy link

Scipi commented Apr 19, 2022

  • I have tried with the latest version of Docker Desktop
  • I have tried disabling enabled experimental features
  • I have uploaded Diagnostics
  • Diagnostics ID: 349C1670-A8E6-4837-B3CC-070AC29DDCC5/20220421135113

Expected behavior

Specifying cgroup limits to a child container should work as expected when using cgroups v2

Actual behavior

Specifying cgroup limits to a child container when using cgroups v2 causes an error

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: cannot enter cgroupv2 "/sys/fs/cgroup/docker/daf122c11ae7f14b6a8174ec6338e9740f1c26a60b332a6094527ffbfa7f302f" with domain controllers -- it is in domain threaded mode: unknown.

Information

  • macOS Version: Monterey 12.2.1
  • Intel chip or Apple chip: Intel
  • Docker Desktop Version: 4.7.0

Output of /Applications/Docker.app/Contents/MacOS/com.docker.diagnose check

Starting diagnostics

[PASS] DD0027: is there available disk space on the host?
[PASS] DD0028: is there available VM disk space?
[PASS] DD0031: does the Docker API work?
[PASS] DD0004: is the Docker engine running?
[PASS] DD0011: are the LinuxKit services running?
[PASS] DD0016: is the LinuxKit VM running?
[PASS] DD0001: is the application running?
[PASS] DD0018: does the host support virtualization?
[PASS] DD0017: can a VM be started?
[PASS] DD0015: are the binary symlinks installed?
[PASS] DD0003: is the Docker CLI working?
[PASS] DD0013: is the $PATH ok?
[PASS] DD0007: is the backend responding?
[PASS] DD0014: are the backend processes running?
[PASS] DD0008: is the native API responding?
[PASS] DD0009: is the vpnkit API responding?
[PASS] DD0010: is the Docker API proxy responding?
[PASS] DD0012: is the VM networking working?
[PASS] DD0032: do Docker networks overlap with host IPs?
[SKIP] DD0030: is the image access management authorized?
[PASS] DD0019: is the com.docker.vmnetd process responding?
[PASS] DD0033: does the host have Internet access?
No fatal errors detected.

Steps to reproduce the behavior

  1. Ensure cgroups v2 is being used (ie, docker info)
  2. Run a container with a cgroup limit
    docker run --rm -d --name pause -p 8080:80 --ipc=shareable --cpus=1 gcr.io/google_containers/pause-amd64:3.0
  3. Run a child container with limits specified as well
    docker run --rm --name stress --net=container:pause --ipc=container:pause --pid=container:pause --cgroup-parent="/docker/<parent-id>" --cpus=2 alexeiled/stress-ng --cpu=2

An error is produced:
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: cannot enter cgroupv2 "/sys/fs/cgroup/docker/daf122c11ae7f14b6a8174ec6338e9740f1c26a60b332a6094527ffbfa7f302f" with domain controllers -- it is in domain threaded mode: unknown.

Running the child container without --cpu=2 allows the container the work. The failing case also works fine on linux.

@fredericdalleau
Copy link

fredericdalleau commented May 18, 2022

Hi, this seems to be an error reported by runc https://github.com/opencontainers/runc/blob/eddf35e5462e2a9f24d8279874a84cfc8b8453c2/libcontainer/cgroups/fs2/create.go#L132-L133

Could you try the docker option --cgroupns=host to allow sharing; by default. they run with --cgroupns=private now with cgroups v2
(see moby/moby#38377 and moby/moby#41072)

@pete-woods
Copy link

Thanks for the suggestion - unfortunately I see the same error with that addition option. I'm far from knowledgeable on cgroups, but it felt to me like there's some restriction on child cgroups in cgroupsv2 by default, which needs changing with settings like echo '+cpu -memory' > /sys/fs/cgroup/cg1/cgroup.subtree_control:

@fredericdalleau
Copy link

fredericdalleau commented May 26, 2022

Hi @pete-woods thanks for this pointer.
My understanding at the moment is that runc refuses to configure the child cgroup because of one of the cgroups requirements for the child containers are not compatible with that of the parent cgroup. AFAICT the decision comes from runc, not from the cgroups subsystem.
IIUC, it seems dockerd requests cpus ressources in terms of NanoCpus, and this is translated to CPUPeriod and CPUQuota which causes the test isCPUset(r) in containsDomainController fail.

@mzarismithnyc
Copy link

mzarismithnyc commented Jun 3, 2022

Docker confirmed there is a limitation with runc and the workaround (until Docker fixes this) is to use cgroupsv1 for MacOS v 4.3.0. We're still waiting to hear back when they will have this fixed but they are aware of the issue.

@fredericdalleau
Copy link

According to https://docs.docker.com/desktop/release-notes/#bug-fixes-and-minor-changes-12,

Added a deprecated option to settings.json: "deprecatedCgroupv1": true, which switches the Linux environment back to cgroups v1. If your software requires cgroups v1, you should update it to be compatible with cgroups v2. Although cgroups v1 should continue to work, it is likely that some future features will depend on cgroups v2. It is also possible that some Linux kernel bugs will only be fixed with cgroups v2.

should allow you to run newer Docker Desktop with cgroups v1 on mac.

@pete-woods
Copy link

Hello! Is there a moby ticket representing this work anywhere? Thanks!

@fredericdalleau
Copy link

fredericdalleau commented Jun 16, 2022

@Scipi can you provide the commands you used on Linux? I couldn't get it to work. Can you share the docker info too?

@Scipi
Copy link
Author

Scipi commented Jun 16, 2022

@fredericdalleau docker info

 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.8.2-docker)
  compose: Docker Compose (Docker Inc., 2.5.1)

Server:
 Containers: 63
  Running: 0
  Paused: 0
  Stopped: 63
 Images: 44
 Server Version: 20.10.16
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 212e8b6fa2f44b9c21b2798135fc6fb7c53efc16.m
 runc version: 
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.10.117-1-MANJARO
 Operating System: Manjaro Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 7.733GiB
 Name: dev-vmwarevirtualplatform
 ID: XISC:6UBE:KYNX:7IUF:I634:XJDL:WQKB:LUYX:FHOU:HGPZ:EFRE:WVR6
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: scipii
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

I used the commands in the original post, but for the parent-id I used /sys/fs/cgroup/<parentid>.slice. I poked around again after your original suggestion and it turned out I wasn't able to get it fully working on linux either (in hindsight I should have probably updated this ticket after that). The issue on linux seems to be different, though. With the systemd driver the container cgroup is named as docker-<id>.scope but the daemon is pretty adamant that the parent id needs to be named as xxx.slice

docker: Error response from daemon: cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice".

I just tried with the cgroupfs driver and it seems the original issue appears

docker run --rm --name stress --net=container:pause --ipc=container:pause --pid=container:pause --cgroup-parent="/docker/36104d048d76eac6d65f53e1a6ed7b9b4e03206fb30cab43e732dcea8a5ba9d2" --cpus=2 alexeiled/stress-ng --cpu=2

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: cannot enter cgroupv2 "/sys/fs/cgroup/docker/36104d048d76eac6d65f53e1a6ed7b9b4e03206fb30cab43e732dcea8a5ba9d2" with domain controllers -- it is in domain threaded mode: unknown.

@sebastian-lerner
Copy link

@fredericdalleau Gentle ping, is there a moby ticket representing this work?

@sebastian-lerner
Copy link

@fredericdalleau should we create a ticket on the runc project instead to track this?

@thaJeztah
Copy link
Member

@lifubang @AkihiroSuda this error was added in opencontainers/runc#2390 - do one of you know if this is not possible or if it's something that just was not (yet) implemented in runc?

@kolyshkin
Copy link

So, cgroup v2 has a constraint -- you can only have processes in leaf nodes*.

In particular, this means that if you want two containers to be limited by the same set of resources, you have to create these containers as sub-cgroups of a particular cgroup with resources.

@mzarismithnyc
Copy link

So, cgroup v2 has a constraint -- you can only have processes in leaf nodes*.

In particular, this means that if you want two containers to be limited by the same set of resources, you have to create these containers as sub-cgroups of a particular cgroup with resources.

So going back to @thajezteh comment, is this NOT possible at all?

@pete-woods
Copy link

pete-woods commented Aug 30, 2022

This is exactly our use case. On our CI platform, we run a "pod-like" construct that shares CPU and RAM limits between all containers in the "pod". We do this by creating a "parent" container with e.g. 2CPUs and 4GiB of RAM limits, noting its container ID, and then starting the other containers by guessing its cgroup path, and specifying that as the parent cgroup.

(Of course we would far prefer built-in support for starting a group of containers with a shared resource quota, than our current solution)

@thaJeztah
Copy link
Member

The slightly longer answer to the above; from a conversation I had with @kolyshkin at the time. My (simplified) interpretation of the issue is that; that v2 cgroups do not allow for a cgroup tree to join / become a child of a tree that has a process running in it. In the example case, "container A" is created, which creates its cgroup tree with the container's process running in it. When the second container joins the cgroup it attempts to join as a child of that cgroup, but this is blocked because that cgroup contains the process for container A, so gets refused.

Changing this is complicated, and may even require changes in the OCI Runtime Specification; a possibly solution for this would be for containers to always set up a "parent" cgroup without process in it, then put their own group inside it; effectively, each container would become a "pod", and setting resource-constraints would now have to be done on the "wrapper" (parent) cgroup. This is a divergence from the OCI spec as it's currently defined.
A second container "joining" the first container's cgroup would now require it to join that wrapper cgroup (instead of the container's own cgroup), which could add additional challenges (should the second container's parent cgroup join, or only the "child"?).

I was told crun had (some ?) support for such constructs; I haven't looked into that, but it may be worth exploring to see if that addresses the use-case. In either case, changes in the OCI specification are likely needed to facilitate this, and to preserve interoperability between systems.

@docker-robott
Copy link
Collaborator

There hasn't been any activity on this issue for a long time.
If the problem is still relevant, mark the issue as fresh with a /remove-lifecycle stale comment.
If not, this issue will be closed in 30 days.

Prevent issues from auto-closing with a /lifecycle frozen comment.

/lifecycle stale

@thaJeztah
Copy link
Member

/lifecycle frozen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants