You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.
GPU becomes unavailable after some time in Kubernetes environment
We have the problem that GPUs become unavailable in a Kubernetes pod. After some time the Kubernetes pod created, we tried to execute nvidia-smi command in the pod, but got a Failed to initialize NVML: Unknown Error. error message.
2. Steps to reproduce the issue
E.g. Create a Kubernetes pod with nvidia-driver installed on a system with ubuntu 20.04 and watch -n 1 nvidia-smi inside the pod (might take minutes to several hours).
3. Information to attach (optional if deemed irrelevant)
Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
Kernel version from uname -a
Any relevant kernel output lines from dmesg
Driver information from nvidia-smi -a
Docker version from docker version
NVIDIA packages version from dpkg -l '*nvidia*'orrpm -qa '*nvidia*'
NVIDIA container library version from nvidia-container-cli -V
1. Issue or feature description
GPU becomes unavailable after some time in Kubernetes environment
We have the problem that GPUs become unavailable in a Kubernetes pod. After some time the Kubernetes pod created, we tried to execute nvidia-smi command in the pod, but got a
Failed to initialize NVML: Unknown Error.
error message.2. Steps to reproduce the issue
E.g. Create a Kubernetes pod with nvidia-driver installed on a system with ubuntu 20.04 and watch -n 1 nvidia-smi inside the pod (might take minutes to several hours).
3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
uname -a
dmesg
nvidia-smi -a
docker version
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
The text was updated successfully, but these errors were encountered: