Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newer runc versions break support for NVIDIA GPUs #3708

Closed
klueska opened this issue Jan 23, 2023 · 21 comments
Closed

Newer runc versions break support for NVIDIA GPUs #3708

klueska opened this issue Jan 23, 2023 · 21 comments

Comments

@klueska
Copy link

klueska commented Jan 23, 2023

The following logic originally introduced in libcontainer/cgroups/systemd/common.go and then later moved / extended to the following in libcontainer/cgroups/devices/systemd.go will never match against NVIDIA GPU devices.

                } else {
                        // "_ n:m _" rules are just a path in /dev/{block,char}/.
                        switch rule.Type {
                        case devices.BlockDevice:
                                entry.Path = fmt.Sprintf("/dev/block/%d:%d", rule.Major, rule.Minor)
                        case devices.CharDevice:
                                entry.Path = fmt.Sprintf("/dev/char/%d:%d", rule.Major, rule.Minor)
                        }
                        // systemd will issue a warning if the path we give here doesn't exist.
                        // Since all of this logic is best-effort anyway (we manually set these
                        // rules separately to systemd) we can safely skip entries that don't
                        // have a corresponding path.
                        if _, err := os.Stat(entry.Path); err != nil {
                                // Also check /sys/dev so that we don't depend on /dev/{block,char}
                                // being populated. (/dev/{block,char} is populated by udev, which
                                // isn't strictly required for systemd). Ironically, this happens most
                                // easily when starting containerd within a runc created container
                                // itself.

                                // We don't bother with securejoin here because we create entry.Path
                                // right above here, so we know it's safe.
                                if _, err := os.Stat("/sys" + entry.Path); err != nil {
                                        logrus.Warnf("skipping device %s for systemd: %s", entry.Path, err)
                                        continue
                                }
                        }
                }
                deviceAllowList = append(deviceAllowList, entry)

The end result is that containers have access to GPUs when they first come online, but then lose access to GPUs as soon as systemd is triggered to run some reevaluation of the cgroups it manages (e.g. with something as simple as a systemctl daemon-reload).

The reason being that the NVIDIA driver does not register any of its devices with /sys/dev/char. The NVIDIA driver is closed source / non-GPL, and thus is unable to call the standard linux helpers to register the devices it manages here. In fact, the device nodes that ultimately get created under /dev for all NVIDIA devices are all triggered from user-space code.

I have filed an issue internal to NVIDIA to investigate what can be done on the driver front here, but until something gets put in place, GPU support on newer runcs (with systemd cgroup integration) remains to be broken.

We are also investigating workarounds to manually create the necessary symlinks under /dev/char, but so far nothing we have come up with is "fool-proof".

Would it be possible to use the device node under /dev for the DeviceAllowList here rather than relying on the existence of a node under /dev/char or /sys/dev/char?

Here is a simple reproducer using docker on ubuntu22.04 (with a single K80 GPU on it):

$ docker run -d --rm --gpus all \
    --device=/dev/nvidia-uvm \
    --device=/dev/nvidia-uvm-tools \
    --device=/dev/nvidia-modeset \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
    nvidia/cuda:11.0.3-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"
bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b

$ docker logs bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)

ubuntu@ip-172-31-57-160:~$ sudo systemctl daemon-reload

ubuntu@ip-172-31-57-160:~$ docker logs bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
Failed to initialize NVML: Unknown Error
Failed to initialize NVML: Unknown Error

Which is then fixed if I (manually) do the following:

$ sudo ln -s ../nvidia0 /dev/char/195:0
$ sudo ln -s ../nvidiactl /dev/char/195:255
$ sudo ln -s ../nvidia-modeset /dev/char/195:254
$ sudo ln -s ../nvidia-uvm /dev/char/510:0
$ sudo ln -s ../nvidia-uvm-tools /dev/char/510:1
$ sudo ln -s ../nvidia-caps/nvidia-cap1 /dev/char/235:1
$ sudo ln -s ../nvidia-caps/nvidia-cap2 /dev/char/235:2

Rerunning the above -- the issue is now resolved:

$ docker run -d --rm --gpus all \
    --device=/dev/nvidia-uvm \
    --device=/dev/nvidia-uvm-tools \
    --device=/dev/nvidia-modeset \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
    nvidia/cuda:11.0.3-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"
cb1a10d2467b4974006bd8cec66c7528cbf244c4b724b2325390bd0d9ae4e605

$ docker logs cb1a10d2467b
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)

$ sudo systemctl daemon-reload

$ docker logs cb1a10d2467b
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
@klueska
Copy link
Author

klueska commented Jan 23, 2023

/cc @kolyshkin

@kolyshkin
Copy link
Contributor

Well, all I can say that this is complicated; I was looking into that while reviewing PR #3568 (see its comments).

The problem is, runtime spec (https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md to be precise) defines a host device by its major:minor pair (path is only used for the in-container device node). Similarly, both cgroup v1 and cgroup v2 device controllesr use major:minor pair.

OTOH systemd uses path to refer to a device. Since we don't have a way to specify a path (in runtime-spec, that is), we need a way to convert from a major:minor pair to a device name (to supply that to systemd). Fortunately, we have /dev/{block,char}/MM:mm nodes so we can translate easily.

Unfortunately, this is not working for NVidia devices. A workaround is proposed in #3568 but in essence this is a hardcoded kludge, and unfortunately I don't see a clean way to handle this (other than introducing host path to runtime-spec, that is).

The non-hardcoded kludge would be to read and parse the whole /dev, trying to find the node with the needed major:minor. This is time consuming (we need to stat(2) every file) and may have some security implications.

Now, all this is happening because Nvidia doesn't follow the standard practice of device registration. So, for me, the best solution seems to be doing those /dev/char symlinks you demonstrated above in your workaround in some post module loading scripts.

If you have any ideas @klueska please let us know.

@klueska
Copy link
Author

klueska commented Jan 24, 2023

Yeah, the hardcoded kludge in #3568 is definitely not the answer. It's not even guaranteed to work as the device majors (for some of the devices) can change, and the nvidia-caps device major is not even considered.

Two options I see that could help are:

  1. Re-evaluate the set of allowed-devices after the prestart/createRuntime hooks have run
  2. Introduce a new hook before the initial cgroup setup happens

This would allow us to insert a hook that creates the necessary /dev/char files "just-in-time" before the deviceAllowList is populated.

Unfortunately, this won't work for the rootless case, but it would at least provide a well-defined "out" for most of our users.

Regarding:

Now, all this is happening because Nvidia doesn't follow the standard practice of device registration.

I agree the best solution is for NVIDIA to start registering its devices under /sys/dev. This problem would simply go away if this was done properly. Unfortunately, there is internal resistance to do so because of GPL compatibility issues. It could be done in the new open source version of the NVIDIA kernel module (https://github.com/NVIDIA/open-gpu-kernel-modules), but that is not production ready on all systems yet, and will likely take years for it to become so.

Regarding:

So, for me, the best solution seems to be doing those /dev/char symlinks you demonstrated above in your workaround in some post module loading scripts.

Unfortunately simply running a post-module-loading script is not sufficient for two reasons:

  1. Some device nodes only get created long after the module has loaded (in support of something called MIG, which allows you to dynamically divide a GPU into a set of smaller mini-gpus).
  2. We often load the NVIDIA kernel module from within a container, meaning that this script would end up creating symlinks under /dev/char relative to the container's root, not the host root. All of the NVIDIA device nodes that we create in this scenario are also relative to the container's /dev, so the script would somehow need to know the actual host path to these devices to create the symlinks in /dev/char anyway.

Workarounds for both (1) and (2) are feasible (i.e. start a long-running script (on the host) that uses inotifywait to watch for the creation of nvidia device nodes at either /dev or a well-defined location where our "driver container" has mapped its /dev directory back onto the host).

However, we'd much prefer a "just-in-time" solution (if possible) until the NVIDIA driver team is able to come up with a proper solution at the driver level.

@klueska
Copy link
Author

klueska commented Jan 24, 2023

Thinking about it a bit more....

Is there any particular reason you can't just use the absolute path to the device node inside the container? From what I understood in your response, device.Path contains this information (modulo the container root, which I assume shouldn't be too hard to make available here). Presumably you create this device node inside the container with the correct major / minor number, so why not just point systemd at this instead of its "original" reference on the host at /dev/char?

@kolyshkin
Copy link
Contributor

I might finally have a fix for that: #3842

@kolyshkin
Copy link
Contributor

@klueska can you please check if the patch from #3842 / #3845 fixes this issue?

@klueska
Copy link
Author

klueska commented Apr 25, 2023

@elezar can you take a look while I am away.

@elezar
Copy link

elezar commented Apr 25, 2023

Sure. Will have a look tomorrow.

@kolyshkin
Copy link
Contributor

With the fix I was able to give access to a device which does not have an appropriate /dev/{char,block}/MM:mm entry, using a systemd cgroup manager, so I guess this is now fixed, but just want to confirm.

@elezar
Copy link

elezar commented Apr 26, 2023

I can confirm that the failure mode is no longer triggered with the fixes in place.

I first started to reproduce the behaviour that @klueska described using:

$ runc --version
runc version 1.1.5
commit: v1.1.5-0-gf19387a
spec: 1.0.2-dev
go: go1.19.7
libseccomp: 2.5.3

that was installed on my test system.

I then cloned the runc repo and set up runc-dev as a runtime in my docker daemon config:

{
    "runtimes": {
        "runc-dev": {
            "args": [],
            "path": "/home/ubuntu/src/runc/runc"
        }
    }
}

For the tips of main and the release-1.1 branch, I then built runc from source and started the test using the runc-dev runtime for the following versions:

$ ./runc --version
runc version 1.1.0+dev
commit: v1.1.0-515-g30f9f808
spec: 1.1.0-rc.1
go: go1.20.3
libseccomp: 2.5.3
$ ./runc --version
runc version 1.1.6+dev
commit: v1.1.6-4-gf72cd0a6
spec: 1.0.2-dev
go: go1.20.3
libseccomp: 2.5.3

This was from a clean boot and no nvidia-specific symlinks are present in /dev/char:

$ ls -al /dev/char | grep nvidia | wc -l
0

@cyphar
Copy link
Member

cyphar commented Apr 28, 2023

Can we close this now that #3842 / #3845 have been merged (and 1.1.7 is out)? The fix is only going to work with systemd 240+ but there's not much we can do about that.

@elezar
Copy link

elezar commented Apr 28, 2023

I will leave it to @klueska to make that call. I do believe he's out for a couple of weeks, so it won't be immediately.

@kolyshkin
Copy link
Contributor

No response from @klueska, so I consider this fixed. A comment can still be added to a closed bug, so feel free to share your findings.

@vsxen
Copy link

vsxen commented Jul 9, 2024

runc 1.12 ,Redhat9 systemd 252 still has issues.

@kolyshkin
Copy link
Contributor

runc 1.12 ,Redhat9 systemd 252 still has issues.

Please file a separate bug with a detailed reproducer.

@vsxen
Copy link

vsxen commented Jul 9, 2024

# install nvidia driver & nvidia-container-toolkit

nvidia-smi
Tue Jul  9 16:30:21 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:2F:00.0 Off |                    0 |
| N/A   50C    P8             16W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00000000:86:00.0 Off |                    0 |
| N/A   47C    P8             16W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+


wget https://github.com/containerd/containerd/releases/download/v1.7.16/cri-containerd-cni-1.7.16-linux-amd64.tar.gz

#  run a container

ctr -n k8s.io run --net-host --cgroup "k8s.slice:demo:nvidia" --runc-systemd-cgroup --rm -t --env NVIDIA_VISIBLE_DEVICES=all --runc-binary /usr/bin/nvidia-container-runtime nvidia/k8s-device-plugin  nvidia-smi bash

# daemon  reload
systemctl daemon-reload


nvidia-smi
Failed to initialize NVML: Unknown Error


nvidia-container-runtime -v
runc version 1.1.12
commit: v1.1.12-0-g51d5e946
spec: 1.0.2-dev
go: go1.20.13
libseccomp: 2.5.2

#systemctl --version
systemd 252 (252-32.el9_4)

Another thing I am curious about is that DeviceAllow has never had Nvidia configuration

# /run/systemd/transient/demo-nvidia.scope.d/50-DeviceAllow.conf
# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:1 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m

@kolyshkin
Copy link
Contributor

@vsxen I don't see you explicitly granting access to any nvidia devices.

Also, as I said earlier, please file a separate bug report if you feel something is broken.

wenjianhn added a commit to wenjianhn/runc that referenced this issue Dec 23, 2024
Every unit created by runc need daemon reload since systemd v230.
This breaks support for NVIDIA GPUs, see
opencontainers#3708 (comment)

Add a workaround for the below systemd issue.
systemd/systemd#35710

Instead of filling the empty DeviceAllow array, a new array is created
with allowed devices. Remove the comment about it, since it's misleading.

Closes opencontainers#4568

Signed-off-by: Jian Wen <[email protected]>
@wenjianhn
Copy link

@vsxen please check if this PR fixes the issue.
#4569

@kolyshkin
Copy link
Contributor

OK, the systemd restart issue is definitely not the one which was originally described here (and fixed by #3842 / #3845). Also, in #3708 (comment) the runc binary is not even runc.

@vsxen
Copy link

vsxen commented Dec 23, 2024

@wenjianhn I need a precompiled runc binary for testing (I haven't compiled runc myself yet).

@kolyshkin

the runc binary is not even runc.

This is a requirement from NVIDIA.

The NVIDIA Container Runtime is a shim for OCI-compliant low-level runtimes such as runc. When a create command is detected, the incoming OCI runtime specification is modified in place and the command is forwarded to the low-level runtime.

https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/cmd/nvidia-container-runtime/README.md#the-nvidia-container-runtime

@wenjianhn
Copy link

@vsxen it's easy to build one

make && sudo cp runc $(which runc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants