Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroup/systemd: fix making CharDevice path in systemdProperties #3568

Closed

Conversation

yangfeiyu20102011
Copy link

cgroup/systemd: func systemdProperties will set CharDevice path like /dev/char/0:0,

but NVIDIA devices with major 195:* and minor 507:* can not be found in path /dev/char/x:x,
getNVIDIAEntryPath will fix this problem.

Signed-off-by: yangfeiyu20102011 [email protected]

@yangfeiyu20102011
Copy link
Author

yangfeiyu20102011 commented Aug 23, 2022

PTAL, thanks! cc @AkihiroSuda @thaJeztah
#3567

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment about the implementation, but in general, this feels rather odd to have this exception for these devices, and I wonder if this should be included in runc, being the reference implementation of the OCI runtime spec; does the spec describe anything about this special case?

libcontainer/cgroups/devices/systemd.go Outdated Show resolved Hide resolved
libcontainer/cgroups/devices/systemd.go Outdated Show resolved Hide resolved
@yangfeiyu20102011
Copy link
Author

yangfeiyu20102011 commented Aug 23, 2022

Left a comment about the implementation, but in general, this feels rather odd to have this exception for these devices, and I wonder if this should be included in runc, being the reference implementation of the OCI runtime spec; does the spec describe anything about this special case?

Thanks, I have modified the code.
When doing runc update, it will skip setting the cgroup device files.
If the spec contains devices like /dev/nvidia*, it will make the DeviceAllow.conf as follow.

cat /run/systemd/system/cri-containerd-74e9a65ee73edfecdf345f477e4bfb44a39428243d4a64519cd860fcc0f6901b.scope.d/50-DeviceAllow.conf
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/195:0 rw
DeviceAllow=/dev/char/195:1 rw

The DeviceAllow=/dev/char/195:0 rw will not work.

And if DevicePolicy.conf set DevicePolicy=strict, the devices.list may end in
c 195:* m
after
setUnitProperties(m.dbus, unitName, properties...)

cat /run/systemd/system/cri-containerd-74e9a65ee73edfecdf345f477e4bfb44a39428243d4a64519cd860fcc0f6901b.scope.d/50-DevicePolicy.conf
[Scope]
DevicePolicy=strict

cat /sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podeeccd2f8_7bef_4054_a659_6554b908432a.slice/cri-containerd-74e9a65ee73edfecdf345f477e4bfb44a39428243d4a64519cd860fcc0f6901b.scope/devices.list
c 136:* rwm
c 5:2 rwm
c 195:* m

@kolyshkin
Copy link
Contributor

Indeed, not all character devices have /dev/char/MM:mm equivalent for some reason. Here's what I found on my machine (Fedora 36 laptop running kernel 5.17.14-300.fc36.x86_64):

[kir@kir-rhat linux]$ ls -lR /dev | grep ^c | awk '{print $10, $5, $6}' | sed -e 's|, |:|' -e 's| | /dev/char/|' | awk '{printf "/dev/" $1 "\t"; system("ls -l " $2);}' 2>&1 | grep cannot
/dev/cuse	ls: cannot access '/dev/char/10:203': No such file or directory
/dev/lp0	ls: cannot access '/dev/char/6:0': No such file or directory
/dev/lp1	ls: cannot access '/dev/char/6:1': No such file or directory
/dev/lp2	ls: cannot access '/dev/char/6:2': No such file or directory
/dev/lp3	ls: cannot access '/dev/char/6:3': No such file or directory
/dev/ppp	ls: cannot access '/dev/char/108:0': No such file or directory
/dev/uhid	ls: cannot access '/dev/char/10:239': No such file or directory
/dev/uinput	ls: cannot access '/dev/char/10:223': No such file or directory
/dev/vhci	ls: cannot access '/dev/char/10:137': No such file or directory
/dev/vhost-vsock	ls: cannot access '/dev/char/10:241': No such file or directory

(there might be some more char devices in subdirectories of /dev)

For block devices, I haven't found any that does not have a symlink in /dev/block.

Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found. WDYT @cyphar

func systemdProperties will set CharDevice path like /dev/char/0:0,
but NVIDIA devices with major 195:* and minor 507:* can not be found in path /dev/char/x:x,
getNVIDIAEntryPath will fix this problem.

Signed-off-by: yangfeiyu20102011 <[email protected]>
@yangfeiyu20102011
Copy link
Author

Now the NVIDIA devices in DeviceAllow.conf are not as expected. This PR can solve some NVIDIA GPU rw problems and it is a improved method at least.
We can completely solve the char devices problem in a better way in the future.
cc @thaJeztah @cyphar

@kolyshkin
Copy link
Contributor

@yangfeiyu20102011 can you please provide OCI spec example with NVidia devices added?

@yangfeiyu20102011
Copy link
Author

yangfeiyu20102011 commented Sep 1, 2022

@yangfeiyu20102011 can you please provide OCI spec example with NVidia devices added?

cc @kolyshkin
OK, here are the spec and DeviceAllow.conf
oci spec

{
    "ociVersion": "1.0.2-dev",
    "process":
    {
        "user":
        {
            "uid": 0,
            "gid": 0
        },
        "args":
        [
            "sleep",
            "36000"
        ],
        "env":
        [
            "PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "HOSTNAME=gpu-operator-test",
            "NVARCH=x86_64",
            "NVIDIA_REQUIRE_CUDA=cuda>=11.0 brand=tesla,driver>=418,driver<419",
            "NV_CUDA_CUDART_VERSION=11.0.221-1",
            "NV_CUDA_COMPAT_PACKAGE=cuda-compat-11-0",
            "CUDA_VERSION=11.0.3",
            "LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
            "NVIDIA_VISIBLE_DEVICES=all",
            "NVIDIA_DRIVER_CAPABILITIES=compute,utility",
            "NVIDIA_VISIBLE_DEVICES=GPU-44f83262-58b8-2db0-7960-01d193fcf7b5",
            "NOT_HOST_NETWORK=true",
            "KUBERNETES_SERVICE_PORT=443",
            "KUBERNETES_SERVICE_PORT_HTTPS=443"
        ],
        "cwd": "/",
        "capabilities":
        {
            "bounding":
            [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "effective":
            [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "permitted":
            [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ]
        },
        "oomScoreAdj": 1000
    },
    "root":
    {
        "path": "rootfs"
    },
    "mounts":
    [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc",
            "options":
            [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/dev",
            "type": "tmpfs",
            "source": "tmpfs",
            "options":
            [
                "nosuid",
                "strictatime",
                "mode=755",
                "size=65536k"
            ]
        },
        {
            "destination": "/dev/pts",
            "type": "devpts",
            "source": "devpts",
            "options":
            [
                "nosuid",
                "noexec",
                "newinstance",
                "ptmxmode=0666",
                "mode=0620",
                "gid=5"
            ]
        },
        {
            "destination": "/dev/mqueue",
            "type": "mqueue",
            "source": "mqueue",
            "options":
            [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/sys",
            "type": "sysfs",
            "source": "sysfs",
            "options":
            [
                "nosuid",
                "noexec",
                "nodev",
                "ro"
            ]
        },
        {
            "destination": "/sys/fs/cgroup",
            "type": "cgroup",
            "source": "cgroup",
            "options":
            [
                "nosuid",
                "noexec",
                "nodev",
                "relatime",
                "ro"
            ]
        },
        {
            "destination": "/dev/nvidiactl",
            "type": "bind",
            "source": "/dev/nvidiactl",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/dev/nvidia0",
            "type": "bind",
            "source": "/dev/nvidia0",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/dev/nvidia-uvm",
            "type": "bind",
            "source": "/dev/nvidia-uvm",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/etc/hosts",
            "type": "bind",
            "source": "/data/kubelet/pods/c043b554-0f9c-4db6-874b-6977ee24fa96/etc-hosts",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/dev/termination-log",
            "type": "bind",
            "source": "/data/kubelet/pods/c043b554-0f9c-4db6-874b-6977ee24fa96/containers/cuda-vector-add/e8ed181c",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/etc/hostname",
            "type": "bind",
            "source": "/media/disk1/containerd/io.containerd.grpc.v1.cri/sandboxes/c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd/hostname",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/etc/resolv.conf",
            "type": "bind",
            "source": "/media/disk1/containerd/io.containerd.grpc.v1.cri/sandboxes/c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd/resolv.conf",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/dev/shm",
            "type": "bind",
            "source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd/shm",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/var/run/secrets/kubernetes.io/serviceaccount",
            "type": "bind",
            "source": "/data/kubelet/pods/c043b554-0f9c-4db6-874b-6977ee24fa96/volumes/kubernetes.io~secret/default-token",
            "options":
            [
                "rbind",
                "rprivate",
                "ro"
            ]
        }
    ],
    "hooks":
    {
        "prestart":
        [
            {
                "path": "/usr/bin/nvidia-container-runtime-hook",
                "args":
                [
                    "/usr/bin/nvidia-container-runtime-hook",
                    "prestart"
                ]
            }
        ]
    },
    "annotations":
    {
        "io.kubernetes.cri.container-name": "cuda-vector-add",
        "io.kubernetes.cri.container-type": "container",
        "io.kubernetes.cri.image-name": "docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04",
        "io.kubernetes.cri.sandbox-id": "c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd",
        "io.kubernetes.cri.sandbox-name": "gpu-operator-test",
        "io.kubernetes.cri.sandbox-namespace": "default"
    },
    "linux":
    {
        "resources":
        {
            "devices":
            [
                {
                    "allow": false,
                    "access": "rwm"
                },
                {
                    "allow": true,
                    "type": "c",
                    "major": 195,
                    "minor": 255,
                    "access": "rw"
                },
                {
                    "allow": true,
                    "type": "c",
                    "major": 195,
                    "minor": 3,
                    "access": "rw"
                }
            ],
            "memory":
            {},
            "cpu":
            {
                "shares": 2,
                "period": 100000,
                "cpus": "0-79"
            }
        },
        "cgroupsPath": "kubepods-besteffort-podc043b554_0f9c_4db6_874b_6977ee24fa96.slice:cri-containerd:73376298ce204adb73424bc020366b89281562a5560cbfaeaee0af0f39071511",
        "namespaces":
        [
            {
                "type": "pid"
            },
            {
                "type": "ipc",
                "path": "/proc/340434/ns/ipc"
            },
            {
                "type": "uts",
                "path": "/proc/340434/ns/uts"
            },
            {
                "type": "mount"
            },
            {
                "type": "network",
                "path": "/proc/340434/ns/net"
            }
        ],
        "devices":
        [
            {
                "path": "/dev/nvidiactl",
                "type": "c",
                "major": 195,
                "minor": 255,
                "fileMode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "path": "/dev/nvidia3",
                "type": "c",
                "major": 195,
                "minor": 3,
                "fileMode": 438,
                "uid": 0,
                "gid": 0
            }
        ],
        "maskedPaths":
        [
            "/proc/acpi",
            "/proc/kcore",
            "/proc/keys",
            "/proc/latency_stats",
            "/proc/timer_list",
            "/proc/timer_stats",
            "/proc/sched_debug",
            "/proc/scsi",
            "/sys/firmware"
        ],
        "readonlyPaths":
        [
            "/proc/asound",
            "/proc/bus",
            "/proc/fs",
            "/proc/irq",
            "/proc/sys",
            "/proc/sysrq-trigger"
        ]
    }
}

systemd conf:
cat /run/systemd/system/cri-containerd-73376298ce204adb73424bc020366b89281562a5560cbfaeaee0af0f39071511.scope.d/50-DeviceAllow.conf [Scope] DeviceAllow=
DeviceAllow=/dev/char/195:255 rw
DeviceAllow=/dev/char/195:3 rw
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m

@yangfeiyu20102011
Copy link
Author

@thaJeztah @kolyshkin Hi,is there a better solution for solving this problem?

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to being hardcoded, there is no reason to assume that these major and minor numbers will always be associated with NVIDIA devices. The kernel provides basically no guarantees as to what major and minor numbers will be associated with which driver.

At the very least this should be double-checked against /proc/devices -- but even then, I'm not convinced. This also has the potential to be a security issue -- a container might allow 195:254 access for another driver but then this code would cause runc to allow access to the nvidia-related device files because of this hardcoded check.

Unfortunately the correct way to implement this would've been to have the systemd-style semantics (or some other non-major/minor-based semantics) in the runtime-spec so we could've avoided this whole issue. But obviously it's a bit late for that discussion...

@cyphar
Copy link
Member

cyphar commented Sep 19, 2022

Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found.

Hmmm. This would work for most device configurations (though not all of them), but we should absolutely double-check that the path on the host has the same major/minor numbers as the rule that references it (otherwise we may end up with a security issue).

@yangfeiyu20102011
Copy link
Author

Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found.

Hmmm. This would work for most device configurations (though not all of them), but we should absolutely double-check that the path on the host has the same major/minor numbers as the rule that references it (otherwise we may end up with a security issue).

@cyphar Thanks. Is there a plan for solving this issue? I can use this patch in my personal project, but I still hope this problem can be fixed in the latest runc.

@kolyshkin
Copy link
Contributor

Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found.

Hmmm. This would work for most device configurations (though not all of them), but we should absolutely double-check that the path on the host has the same major/minor numbers as the rule that references it (otherwise we may end up with a security issue).

Checking is not an issue. The fact that LinuxDeviceCgroup in OCI runtime spec doesn't have Path field is.

Now I'm thinking about creating a device file and passing it to systemd; this might be easier and less error prone.

@zvier
Copy link
Contributor

zvier commented Dec 14, 2022

Any better solution about this issue ?

@zvier
Copy link
Contributor

zvier commented Feb 27, 2023

The same problem refers to NVIDIA/nvidia-docker#1730 and a fix will be present in the next patch release of all supported NVIDIA GPU drivers.

@kolyshkin
Copy link
Contributor

This is now being fixed by #3842.

@kolyshkin kolyshkin closed this Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants