-
Notifications
You must be signed in to change notification settings - Fork 2k
"Failed to initialize NVML: Unknown Error" after random amount of time #1671
Comments
The |
Hey, I have the same problem. 2. Steps to reproduce the issue
This works until you do (on host): (inside same running container):
Running the container again will work fine until you do another 3. Information to attach (optional if deemed irrelevant)
Other open issuesNVIDIA/nvidia-container-toolkit#251 but this is using cgroup v1 Important notes / workaroundcontainerd.io v1.6.7 or v1.6.8 even with Downgrading containerd.io to 1.6.6 works as long as you specify |
@elezar Previously persistence mode was off, so this happens either way. Also, on k8s-device-plugin/issues/289 @klueska said: |
@kevin-bockman the experimental mode is still a work in progress and we don't have a concrete timeline on when this will be available for testing. I will update the issue here as soon as I have more information. |
The other option is to move to |
@klueska Sorry, with all of the information, it wasn't really clear. The problem is that it's already on cgroupv2 AFAIK. I started from a fresh install of Ubuntu 22.04.1. The only way I could get this to work after a
|
@kevin-bockman I had a similar experience. In my case,
This setting is working in some machines and not working in other machines. I tried to downgrade and upgrade the version of containerd.io to check this strategy works or not. |
Above one is not the answer... This prevents nmvl error from docker resource update, but nvml error still occurs after random amount of time. |
Same issue. Ubuntu 22,docker ce. I will just end up writing a cron job script to check for the error and restart the container |
The solution proposed by @kevin-bockman has been working without any problem for more than a month now.
|
I am using docker-ce on Ubuntu 22, so I opted for this approach, working fine so far. |
same issue on Nvidia 3090 |
Hello there. I'm hitting the same issue here, but with Here's my configuration:
Note that the Nvidia's container toolkit has been installed with the Nvidia's GPU operator on Kubernetes (v1.25.3). I attached the containerd configuration file and the nvidia-container-runtime configuration file to my comment. How I reproduce this bug: Running on my host the following command: # nerdctl run -n k8s.io --runtime=/usr/local/nvidia/toolkit/nvidia-container-runtime --network=host --rm -ti --name ubuntu --gpus all -v /run/nvidia/driver/usr/bin:/tmp/nvidia-bin docker.io/library/ubuntu:latest bash After some time, the Traces, logs, etc...
Thank you very much for your help. 🙏 |
Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env. https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc |
@gengwg Can you try if your solution works by calling |
yes. that's actually the first thing i tested when upgraded v1 --> v2. it's easy to test, because it doesn't need wait a few hours/days. to double check, i just tested it again right now. Before:
Do the reload on that node itself:
After:
I will update the note to reflect this test too. |
And I can also confirm that's what I saw on our cgroupv1 nodes too, i.e. |
Hi, what's your cgroup driver for kubelet and containerd? We meed the same problem in cgroup v2, our cgroup driver is Also, if we switch the cgroup driver of docker to cgroupfs, it will also solve the problem. |
I've also tried this way, the reason why containerd 1.6.7 can't work is because runc has been updated to 1.1.3, in this version runc will ignore some char devices can't be |
@gengwg Thanks for sharing your document. As I run my kubernetes cluster on ubuntu 22.04, cgroupv2 is the default cgroup subsystem used. I deployed two environments to help me making some comparisons:
Interestingly, I never face this issue on the second environment, everything is running perfectly well. The first environment though is running into this issue after some time. That would probably means that Nvidia's container runtime isn't the faulty component here, but it needs more investigations on my side to be sure that I'm not missing anything. I'll have a look at the cgroup driver as @panli889 mentioned. Thanks again for your help |
cgroup driver for kubelet, docker and containerd are all
We are in the middle of migrating docker to containerd, so we have both docker and containerd nodes. This seem fixed it for BOTH. Docker nodes:
Containerd nodes:
Here is our k8s version:
|
I think ours is similar to your 2nd env, i.e. containerd & nvidia-container-toolkit. we are on k8s v1.22.9.
i posted cgroup driver info above. |
@gengwg thx for your reply!
Hmm, that's interesting, it's quite different from my situation. Would you please share your systemd version? I can share the problems we meet, if we create a pod with GPU, there will be a related systemd scope created at the same time like
And if we check the content of file So would you please also take a look at the content of |
Same issue with 2 x Nvidia 3090 Ti, Ubuntu 22.04.1 LTS, Driver Version: 510.85.02, CUDA Version: 11.6 |
@panli889 I checked the scope unit with
After running
Here's the content of the
There's indeed no reference to nvidia's devices here:
|
@fradsj thanks for your reply, seems the same problem as us. Here is how we solve it, hope it will help:
|
Hi, Any official way to fix this error ? |
See https://github.com/NVIDIA/k8s-device-plugin#setting-other-helm-chart-values (which needs an update for a disscussion on the options and setting up privileged). Privileged mode is required when passing the device specs so that the device plugin can see all the required device nodes. Otherwise it would not have the required accesss (even though this is also provided by the nvidia container toolkit). |
Using privileged mode for DP didn't work.. But using privileged mode for user workload Pod did work. Also, it seems that as long as the user workload Pod is privileged, there aren't any problems -- DP doesn't need to be privileged, no symlinks for the char devices need to be created. |
That is true, but most users don't want to run their user pods as privileged (and they shouldn't have to if everything else is set up properly). |
device-plugin version 1.0.0-beta runc will also write cgroup fs if has device list ; so |
Some relevant comments & solutions from @cdesiniotis at nvidia on the matter: #1730 |
Thanks @breakingflower, that's very useful. FYI: From the Notice:
I can confirm that using the new version of GPU Operator resolves the issue when CDI is enabled in
However, I am facing the issue where
|
Any update on this? |
Please see this notice from February: |
Is there any timeline for a solution besides the workarounds exposed in #1730 ? |
I tried the suggested approach in #6380, but it didn't solve the problem. It is quite frustrating as I cannot rely on AKS at the moment. I hope this issue is solved soon. |
@rogelioamancisidor we've heard that AKS ships with a really old version of the k8s-device-plugin (from 2019!) which doesn't support the PASS_DEVICE_SPECS flag. You will need to update the plugin to a newer one and pass this flag for things to work on AKS. |
The plugin is available here: https://github.com/NVIDIA/k8s-device-plugin the README should cover a variety of deployment options, where helm is recommended. The latest version of the plugin is |
I deployed a DaemonSet for the NVIDIA device plugin using the yaml manifest in the link that I posted. The manifest in the link includes this line |
here is the official soluton modify
it is working. |
Isn't it |
@homjay I dont think that solution works on K8s |
This is an issue as described in NVIDIA/nvidia-container-toolkit#48 Since this issue has a number of different failure modes discussed, I'm going to close this issue and ask that those still having a problem open new issues in the respective repositories.
We are looking to migrate all issues in this repo to https://github.com/NVIDIA/nvidia-container-toolkit in the near term. |
1. Issue or feature description
After a random amount of time (it could be hours or days) the GPUs become unavailable inside all the running containers and
nvidia-smi
returns "Failed to initialize NVML: Unknown Error".A restart of all the containers fixes the issue and the GPUs return available.
Outside the containers the GPUs are still working correctly.
I tried searching in the open/closed issues but I could not find any solution.
2. Steps to reproduce the issue
All the containers are run with
docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash
3. Information to attach
nvidia-container-cli -k -d /dev/tty info
uname -a
nvidia-smi -a
docker version
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
The text was updated successfully, but these errors were encountered: