GPU becomes unavailable after some time in Kubernetes environment #1661

eason-jiang-intel · 2022-08-09T05:37:07Z

1. Issue or feature description

GPU becomes unavailable after some time in Kubernetes environment

We have the problem that GPUs become unavailable in a Kubernetes pod. After some time the Kubernetes pod created, we tried to execute nvidia-smi command in the pod, but got a Failed to initialize NVML: Unknown Error. error message.

2. Steps to reproduce the issue

E.g. Create a Kubernetes pod with nvidia-driver installed on a system with ubuntu 20.04 and watch -n 1 nvidia-smi inside the pod (might take minutes to several hours).

3. Information to attach (optional if deemed irrelevant)

Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
Kernel version from uname -a
Any relevant kernel output lines from dmesg
Driver information from nvidia-smi -a
Docker version from docker version
NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
NVIDIA container library version from nvidia-container-cli -V
NVIDIA container library logs (see troubleshooting)
Docker command, image and tag used

The text was updated successfully, but these errors were encountered:

elezar · 2023-11-27T11:18:01Z

This is a duplication of a known issue NVIDIA/nvidia-container-toolkit#48 that occurs with some specific runc and systemd version combinations.

Please follow the steps described and open a new issue against https://github.com/NVIDIA/nvidia-container-toolkit if you are still having problems.

kevin-bockman mentioned this issue Sep 2, 2022

"Failed to initialize NVML: Unknown Error" after random amount of time #1671

Closed

7 tasks

elezar closed this as completed Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU becomes unavailable after some time in Kubernetes environment #1661

GPU becomes unavailable after some time in Kubernetes environment #1661

eason-jiang-intel commented Aug 9, 2022

elezar commented Nov 27, 2023

GPU becomes unavailable after some time in Kubernetes environment #1661

GPU becomes unavailable after some time in Kubernetes environment #1661

Comments

eason-jiang-intel commented Aug 9, 2022

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

elezar commented Nov 27, 2023