Updating cpu-manager-policy=static causes NVML unknown error #966

henglianghe · 2019-04-25T13:43:35Z

What happened:

After setting the cpu-manager-policy=static of kubenets, the pod with gpu running nvidia-smi will report an error
```
Failed to initialize NVML: Unknown Error
```
Setting the cpu-manager-policy=none for kubenets will not cause this error
Sometimes when the pod first runs, nvidia-smi will not give an error, and about 10 seconds later, running nvidia-smi will give an error

Check the reason for the error, find it is when reading /dev/nvidiactl Operation not permitted

strace -v -a 100 -s 1000 nvidia-smi
close(3)                                                                                           = 0
open("/dev/nvidiactl", O_RDWR)                                                                     = -1 EPERM (Operation not permitted)
open("/dev/nvidiactl", O_RDONLY)                                                                   = -1 EPERM (Operation not permitted)
fstat(1, {st_dev=makedev(0, 704), st_ino=4, st_mode=S_IFCHR|0620, st_nlink=1, st_uid=0, st_gid=5, st_blksize=1024, st_blocks=0, st_rdev=makedev(136, 1), st_atime=2019/04/23-17:35:28.678347231, st_mtime=2019/04/23-17:35:28.678347231, st_ctime=2019/04/23-17:33:09.682347235}) = 0
write(1, "Failed to initialize NVML: Unknown Error\n", 41Failed to initialize NVML: Unknown Error
)                                         = 41
exit_group(255)                                                                                     = ?
+++ exited with 255 +++

Update the cpu-manager-policy to none and static, and create two pods respectively as test-gpu(nvidia-smi can be run) and test-gpu-err (Running nvidia-smi reports an error).

Check pods's /sys/fs/cgroup/devices/devices, found the difference between the list

test-gpu(nvidia-smi can be run)

root@super8:/sys/fs/cgroup/devices/kubepods/besteffort/pod52c61ec9-65b5-11e9-8cd2-0cc47aea540c/caca989a1f8d1a8c87f67c04d2d63347a98f52d745c44e77895b3ca4dfd9b18f# cat devices.list 
c 1:5 rwm
c 1:3 rwm
c 1:9 rwm
c 1:8 rwm
c 5:0 rwm
c 5:1 rwm
c *:* m
b *:* m
c 1:7 rwm
c 136:* rwm
c 5:2 rwm
c 10:200 rwm
c 195:255 rw
c 195:3 rw

test-gpu-err (Running nvidia-smi reports an error)

root@super8:/sys/fs/cgroup/devices/kubepods/besteffort/podbfa294b1-65aa-11e9-8cd2-0cc47aea540c/771eb2c6d41fe48160000ad481702d09bdda5bfe49d613f96273412e177b449d# cat devices.list 
c 1:5 rwm
c 1:3 rwm
c 1:9 rwm
c 1:8 rwm
c 5:0 rwm
c 5:1 rwm
c *:* m
b *:* m
c 1:7 rwm
c 136:* rwm
c 5:2 rwm
c 10:200 rwm

so, After setting the cpu-manager-policy=static of kubenets,pod with gpu can run nvidia-smi command for a short time, But in a function that runs once in 10 seconds, /sys/fs/cgroup/devices/devices.list will be modified to lose read and write access to /dev/nvidiactl (and should have other files), and then cause nvidia-smi error

The text was updated successfully, but these errors were encountered:

klueska · 2019-04-25T15:03:06Z

Unfortunately, this is a known issue. It was first reported here:
NVIDIA/nvidia-container-toolkit#138

The underlying issue is that libnvidia-container injects some devices and modifies some cgroups out-of-band of the container engine it is operating on behalf of when setting a container up for use with GPUs. This causes the internal state of the container engine to be out of sync with what has actually been set up for the container.

For example, if you do a docker inspect on a functioning GPU-enabled container today, you will see that its device list is empty, even though it clearly has the nvidia devices injected into it and the cgroup access to those devices is set up properly.

This has not been an issue until now because everything works fine at initial container creation time. These settings are modified by libnvidia-container only after a container has already been set up by docker and no further updates to the cgroups are necessary.

The problem comes when some external entity hits docker's ContainerUpdate API (whether directly via the CLI or through an API call like the CPUManager in Kubernetes does). When this API is invoked, docker resolves its empty device list to disk, essentially "undoing" what libnvidia-container had set up in regards to these devices.

We need to come up with a solution that allows libnvidia-container to take control of managing these devices on behalf of docker (or any container engine) while properly informing it so that it can keep its internal state in sync.

klueska · 2019-04-25T15:04:42Z

If your setup is constrained such that GPUs will only ever be used by containers that have CPUSets assigned to them via the static allocation policy, then the following patch to Kubernetes will avoid having docker update its cgroups after these containers are initially launched. This is not a generic enough solution to work in all cases though, unfortunately.

diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager.go b/pkg/kubelet/cm/cpumanager/cpu_manager.go
index 4ccddd5..ff3fbdf 100644
--- a/pkg/kubelet/cm/cpumanager/cpu_manager.go
+++ b/pkg/kubelet/cm/cpumanager/cpu_manager.go
@@ -242,7 +242,8 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                        // - policy does not want to track the container
                        // - kubelet has just been restarted - and there is no previous state file
                        // - container has been removed from state by RemoveContainer call (DeletionTimestamp is set)
-                       if _, ok := m.state.GetCPUSet(containerID); !ok {
+                       cset, ok := m.state.GetCPUSet(containerID)
+                       if !ok {
                                if status.Phase == v1.PodRunning && pod.DeletionTimestamp == nil {
                                        klog.V(4).Infof("[cpumanager] reconcileState: container is not present in state - trying to add (pod: %s, container: %s, container id: %s)", pod.Name, container.Name, containerID)
                                        err := m.AddContainer(pod, &container, containerID)
@@ -258,7 +259,13 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                                }
                        }

-                       cset := m.state.GetCPUSetOrDefault(containerID)
+                       if !cset.IsEmpty() && m.policy.Name() == string(PolicyStatic) {
+                               klog.V(4).Infof("[cpumanager] reconcileState: skipping container; assigned cpuset unchanged (pod: %s, container: %s, container id: %s, cpuset: \"%v\")", pod.Name, container.Name, containerID, cset)
+                               success = append(success, reconciledContainer{pod.Name, container.Name, containerID})
+                               continue
+                       }
+
+                       cset = m.state.GetDefaultCPUSet()
                        if cset.IsEmpty() {
                                // NOTE: This should not happen outside of tests.
                                klog.Infof("[cpumanager] reconcileState: skipping container; assigned cpuset is empty (pod: %s, container: %s)", pod.Name, container.Name)

henglianghe · 2019-04-26T03:30:30Z

@klueska Thanks so much as it works when we use your code snippet and setting CPU POLICY to static and QOS of the container is in "guaranteed" mode.

However, is there anyway to get GPU working for those containers that are not "guaranteed" while CPU_MANAGER_POLICY is set to STATIC. Is it something you guys intend to develop or is it possible for me to work it around.

henglianghe · 2019-04-26T06:42:42Z

@klueska As we know, LXD 3.0.0 is already supported for NVIDIA runtime passthrough.
https://discuss.linuxcontainers.org/t/lxd-3-0-0-has-been-released/1491
If there is a case which kubernetes can create LXD3.0.0 containers instead of docker containers, can we bypass the current problem, since LXD3.0.0 is not a ‘docker' and may not require 'libnvidia-container' ?

klueska · 2019-04-26T10:17:28Z

Any change made to kubernetes is always going to be a workaround. The real fix needs to come in libnvidia-container or docker or some combination of both.

I've never tried LXD with Kubernetes, so I'm not in a position to say how well it would work or not. I do know that LXD still uses libnvidia-container under the hood though, so it may exhibit the same problems.

Again, the underlying problem is that docker is not told about the devices that libnvidia-container injects into it, so if you come up with a workaround that updates docker's internal state with this information, that should be sufficient.

It is not strictly necessary to return the list of device nodes in order to trigger the NVIDIA container stack to injet a set of GPUs into a container. However, without this list, the container runtime (i.e. docker) will not be aware of the device nodes injected into the container by the NVIDIA container stack. This normally has little to no consequence for containers relying on the NVIDIA contaienr stack to allocatn GPUs. However, problems arise when using GPUs in conjunction with the kubernetes CPUManager. The following issue summarizes the problem well: NVIDIA/nvidia-docker#966 With this patch, we add a flag to optionally pass back the list of device nodes that the NVIDIA container stack will inject into the container so that the kubelet can forward it to the container runtime. With this small change, the above issue no longer gets triggered. Signed-off-by: Kevin Klues <[email protected]>

RenaudWasTaken · 2020-04-08T05:17:48Z

@klueska fixed this in the latest NVIDIA device-plugin (beta5) release. Though this is a workaround.

klueska · 2020-04-08T12:26:32Z

Note, to use this workaround you will need to use the new daemonset spec nvidia-device-plugin-compat-with-cpumanager.yml instead of the default one.

This spec does two things different from the default one:

It adds a new argument to the plugin executable for --pass-device-specs
It launches the plugin as --privileged

If you don't want to use the --privileged flag, then things will still "work" in terms of allowing pods with GPUs to run, but you will see the plugin restart anytime a container with guaranteed CPUs from the CPUManager starts. If you are OK with this restart, then launching the daemonset as --privileged is not strictly necessary.

It is not strictly necessary to return the list of device nodes in order to trigger the NVIDIA container stack to injet a set of GPUs into a container. However, without this list, the container runtime (i.e. docker) will not be aware of the device nodes injected into the container by the NVIDIA container stack. This normally has little to no consequence for containers relying on the NVIDIA contaienr stack to allocatn GPUs. However, problems arise when using GPUs in conjunction with the kubernetes CPUManager. The following issue summarizes the problem well: NVIDIA/nvidia-docker#966 With this patch, we add a flag to optionally pass back the list of device nodes that the NVIDIA container stack will inject into the container so that the kubelet can forward it to the container runtime. With this small change, the above issue no longer gets triggered. Signed-off-by: Kevin Klues <[email protected]>

zionwu · 2020-11-26T08:46:30Z

@klueska Is there a PR of device plugin for this issue? I would like to learn about the fix.

klueska · 2020-11-26T09:04:48Z

https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/22

zionwu · 2020-12-07T07:02:31Z

@klueska nvidia-device-plugin-compat-with-cpumanager.yml is using nvidia/k8s-device-plugin:v0.7.1. My current cluster is using nvidia/k8s-device-plugin:1.11. Is k8s-device-plugin:v0.7.1 a newer version than k8s-device-plugin:1.11? Can I upgrade the ds directly without breaking things on my production?

klueska · 2020-12-07T11:05:44Z

@zionwu, Yes v0.7.1 is newer than 1.11 and you should be able to upgrade without any issues.

Please see https://github.com/NVIDIA/k8s-device-plugin#versioning for info on versioning / upgrading.

Also, keep in mind that the semantics around deploying the plugin on nodes that do not have GPUs has slightly changed. You may need to set this flag to false in your daemonset if you rely on the ability to deploy it on non-GPU nodes and not error out:
NVIDIA/k8s-device-plugin@2a9b835

zionwu · 2020-12-07T12:05:36Z

Got it. Thank you ! @klueska

henglianghe mentioned this issue Apr 26, 2019

Updating cpu-manager-policy=static causes NVML unknown error kubernetes/kubernetes#77073

Closed

tkanng mentioned this issue May 1, 2019

Fix bug causing GPU unavaliable after updating container resource NVIDIA/nvidia-container-runtime#55

Closed

mattshma mentioned this issue Jul 11, 2019

nvidia-smi 报错 Failed to initialize NVML: Unknown Error mattshma/bigdata#150

Closed

RenaudWasTaken mentioned this issue Oct 25, 2019

Running nvidia-container-runtime with podman is blowing up. NVIDIA/nvidia-container-runtime#85

Closed

zerosnake0 mentioned this issue Dec 30, 2019

Updating CPU quota causes NVML unknown error zerosnake0/cncf_issues#1

Open

klueska closed this as completed May 22, 2020

klueska mentioned this issue Jun 15, 2020

k8s-device-plugin fails with k8s static CPU policy NVIDIA/k8s-device-plugin#145

Closed

10 tasks

klueska mentioned this issue Mar 5, 2021

GPU becomes unavailable after some time in Docker container #1469

Closed

9 tasks

darintay mentioned this issue Jun 4, 2022

kops-managed nvidia support, long-running pods eventually can't see GPU anymore kubernetes/kops#13727

Closed

panli889 mentioned this issue Nov 29, 2022

"Failed to initialize NVML: Unknown Error" after random amount of time #1671

Closed

7 tasks

klueska mentioned this issue Dec 5, 2022

Failed to initialize NVML: Unknown Error NVIDIA/gpu-operator#430

Open

5 tasks

alexeldeib mentioned this issue Jun 10, 2023

GPU becomes unavailable Azure/AKS#3680

Closed

zhuangqh mentioned this issue Oct 18, 2024

fix: NVML unknown error kaito-project/kaito#639

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating cpu-manager-policy=static causes NVML unknown error #966

Updating cpu-manager-policy=static causes NVML unknown error #966

henglianghe commented Apr 25, 2019

klueska commented Apr 25, 2019

klueska commented Apr 25, 2019

henglianghe commented Apr 26, 2019 •

edited

Loading

henglianghe commented Apr 26, 2019 •

edited

Loading

klueska commented Apr 26, 2019

RenaudWasTaken commented Apr 8, 2020

klueska commented Apr 8, 2020

zionwu commented Nov 26, 2020

klueska commented Nov 26, 2020

zionwu commented Dec 7, 2020

klueska commented Dec 7, 2020

zionwu commented Dec 7, 2020

Updating cpu-manager-policy=static causes NVML unknown error #966

Updating cpu-manager-policy=static causes NVML unknown error #966

Comments

henglianghe commented Apr 25, 2019

klueska commented Apr 25, 2019

klueska commented Apr 25, 2019

henglianghe commented Apr 26, 2019 • edited Loading

henglianghe commented Apr 26, 2019 • edited Loading

klueska commented Apr 26, 2019

RenaudWasTaken commented Apr 8, 2020

klueska commented Apr 8, 2020

zionwu commented Nov 26, 2020

klueska commented Nov 26, 2020

zionwu commented Dec 7, 2020

klueska commented Dec 7, 2020

zionwu commented Dec 7, 2020

henglianghe commented Apr 26, 2019 •

edited

Loading

henglianghe commented Apr 26, 2019 •

edited

Loading