Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

GPU becomes unavailable after some time in Kubernetes environment #1661

Closed
9 tasks
eason-jiang-intel opened this issue Aug 9, 2022 · 1 comment
Closed
9 tasks

Comments

@eason-jiang-intel
Copy link

1. Issue or feature description

GPU becomes unavailable after some time in Kubernetes environment

We have the problem that GPUs become unavailable in a Kubernetes pod. After some time the Kubernetes pod created, we tried to execute nvidia-smi command in the pod, but got a Failed to initialize NVML: Unknown Error. error message.

2. Steps to reproduce the issue

E.g. Create a Kubernetes pod with nvidia-driver installed on a system with ubuntu 20.04 and watch -n 1 nvidia-smi inside the pod (might take minutes to several hours).

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
  • Driver information from nvidia-smi -a
  • Docker version from docker version
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • NVIDIA container library version from nvidia-container-cli -V
  • NVIDIA container library logs (see troubleshooting)
  • Docker command, image and tag used
@elezar
Copy link
Member

elezar commented Nov 27, 2023

This is a duplication of a known issue NVIDIA/nvidia-container-toolkit#48 that occurs with some specific runc and systemd version combinations.

Please follow the steps described and open a new issue against https://github.com/NVIDIA/nvidia-container-toolkit if you are still having problems.

@elezar elezar closed this as completed Nov 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants