-
Notifications
You must be signed in to change notification settings - Fork 2k
Updating cpu-manager-policy=static causes NVML unknown error #966
Comments
Unfortunately, this is a known issue. It was first reported here: The underlying issue is that For example, if you do a docker inspect on a functioning GPU-enabled container today, you will see that its device list is empty, even though it clearly has the nvidia devices injected into it and the cgroup access to those devices is set up properly. This has not been an issue until now because everything works fine at initial container creation time. These settings are modified by The problem comes when some external entity hits docker's ContainerUpdate API (whether directly via the CLI or through an API call like the CPUManager in Kubernetes does). When this API is invoked, docker resolves its empty device list to disk, essentially "undoing" what libnvidia-container had set up in regards to these devices. We need to come up with a solution that allows |
If your setup is constrained such that GPUs will only ever be used by containers that have
|
@klueska Thanks so much as it works when we use your code snippet and setting CPU POLICY to static and QOS of the container is in "guaranteed" mode. However, is there anyway to get GPU working for those containers that are not "guaranteed" while CPU_MANAGER_POLICY is set to STATIC. Is it something you guys intend to develop or is it possible for me to work it around. |
@klueska As we know, LXD 3.0.0 is already supported for NVIDIA runtime passthrough. |
Any change made to kubernetes is always going to be a workaround. The real fix needs to come in I've never tried LXD with Kubernetes, so I'm not in a position to say how well it would work or not. I do know that LXD still uses Again, the underlying problem is that docker is not told about the devices that |
It is not strictly necessary to return the list of device nodes in order to trigger the NVIDIA container stack to injet a set of GPUs into a container. However, without this list, the container runtime (i.e. docker) will not be aware of the device nodes injected into the container by the NVIDIA container stack. This normally has little to no consequence for containers relying on the NVIDIA contaienr stack to allocatn GPUs. However, problems arise when using GPUs in conjunction with the kubernetes CPUManager. The following issue summarizes the problem well: NVIDIA/nvidia-docker#966 With this patch, we add a flag to optionally pass back the list of device nodes that the NVIDIA container stack will inject into the container so that the kubelet can forward it to the container runtime. With this small change, the above issue no longer gets triggered. Signed-off-by: Kevin Klues <[email protected]>
@klueska fixed this in the latest NVIDIA device-plugin (beta5) release. Though this is a workaround. |
Note, to use this workaround you will need to use the new daemonset spec This spec does two things different from the default one:
If you don't want to use the |
It is not strictly necessary to return the list of device nodes in order to trigger the NVIDIA container stack to injet a set of GPUs into a container. However, without this list, the container runtime (i.e. docker) will not be aware of the device nodes injected into the container by the NVIDIA container stack. This normally has little to no consequence for containers relying on the NVIDIA contaienr stack to allocatn GPUs. However, problems arise when using GPUs in conjunction with the kubernetes CPUManager. The following issue summarizes the problem well: NVIDIA/nvidia-docker#966 With this patch, we add a flag to optionally pass back the list of device nodes that the NVIDIA container stack will inject into the container so that the kubelet can forward it to the container runtime. With this small change, the above issue no longer gets triggered. Signed-off-by: Kevin Klues <[email protected]>
@klueska Is there a PR of device plugin for this issue? I would like to learn about the fix. |
@klueska nvidia-device-plugin-compat-with-cpumanager.yml is using nvidia/k8s-device-plugin:v0.7.1. My current cluster is using nvidia/k8s-device-plugin:1.11. Is k8s-device-plugin:v0.7.1 a newer version than k8s-device-plugin:1.11? Can I upgrade the ds directly without breaking things on my production? |
@zionwu, Yes Please see https://github.com/NVIDIA/k8s-device-plugin#versioning for info on versioning / upgrading. Also, keep in mind that the semantics around deploying the plugin on nodes that do not have GPUs has slightly changed. You may need to set this flag to |
Got it. Thank you ! @klueska |
What happened:
After setting the cpu-manager-policy=static of kubenets, the pod with gpu running nvidia-smi will report an error
Setting the cpu-manager-policy=none for kubenets will not cause this error
Sometimes when the pod first runs, nvidia-smi will not give an error, and about 10 seconds later, running nvidia-smi will give an error
Check the reason for the error, find it is when reading /dev/nvidiactl Operation not permitted
Update the cpu-manager-policy to none and static, and create two pods respectively as test-gpu(nvidia-smi can be run) and test-gpu-err (Running nvidia-smi reports an error).
Check pods's /sys/fs/cgroup/devices/devices, found the difference between the list
test-gpu(nvidia-smi can be run)
test-gpu-err (Running nvidia-smi reports an error)
so, After setting the cpu-manager-policy=static of kubenets,pod with gpu can run nvidia-smi command for a short time, But in a function that runs once in 10 seconds, /sys/fs/cgroup/devices/devices.list will be modified to lose read and write access to /dev/nvidiactl (and should have other files), and then cause nvidia-smi error
The text was updated successfully, but these errors were encountered: