-
Notifications
You must be signed in to change notification settings - Fork 377
Closed as duplicate of#1766
Closed as duplicate of#1766
Copy link
Description
I use NVIDIA k8s-device-plugin to assign GPUs on Kubernetes. I found the issue that depends on the crun version. It failed to initialize NVML with v1.18+ crun. When this error occurred, it could not assign GPUs to the Pod.
(I don't use NVIDIA GPU Operator, I installed raw NVIDIA k8s-device-plugin as a daemonset.)
Could you investigate this issue? If you need any further logs, please let me know.
Environment
- OS: Ubuntu 22.04
- Kernel: 5.15.0-142-generic
- Kubernetes: v1.30.14
- CRI-O: v1.30.14
- crun: built static binary with nix.
- 1.16.1 ~ 1.18.2, 1.21
- NVIDIA k8s-device-plugin: v0.16.1
- NVIDIA Driver: 575.57.08
- NVIDIA Container Toolkit: 1.17.8-1
- NVIDIA CUDA: 12.6
I set nvidia as the default container runtime in the cri-o configuration.
$ cat /etc/crio/crio.conf.d/99-nvidia.conf
[crio]
[crio.runtime]
default_runtime = "nvidia"
[crio.runtime.runtimes]
[crio.runtime.runtimes.nvidia]
monitor_path = "/usr/libexec/crio/conmon"
runtime_path = "/usr/bin/nvidia-container-runtime"
runtime_type = "oci"
And set crun as a low-level container runtime to use in nvidia-container-runtime.
$ cat /etc/nvidia-container-runtime/config.toml
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"
[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
#log-level = "info"
log-level = "debug"
mode = "auto"
#runtimes = ["/usr/libexec/crio/crun", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/runc", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.16.1", "docker-runc", "runc"]
runtimes = ["/usr/libexec/crio/crun-1.17", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.18", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.18.1", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.18.2", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.21", "docker-runc", "runc"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime.modes.legacy]
cuda-compat-mode = "ldconfig"
[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false
[nvidia-ctk]
path = "nvidia-ctk"
Issue detailed
I tried to confirm the changing point of this issue about crun version.
- v1.16.1: No error
- v1.17: No error
- v1.18: Error occurred
- v1.18.1: Error occurred
- v1.18.2: Error occurred
- v1.21: Error occurred
So I think the changing point is v1.17 → v1.18
The error is shown in the log of NVIDIA k8s-device-plugin’s Pod.
- Error occurred case
$ k logs -n kube-system nvidia-device-plugin-daemonset-vpxv9
I0627 10:12:47.514381 1 main.go:199] Starting FS watcher.
I0627 10:12:47.514475 1 main.go:206] Starting OS watcher.
I0627 10:12:47.514660 1 main.go:221] Starting Plugins.
I0627 10:12:47.514687 1 main.go:278] Loading configuration.
I0627 10:12:47.515500 1 main.go:303] Updating config with default resource matching patterns.
I0627 10:12:47.515761 1 main.go:314]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"mpsRoot": "",
"nvidiaDriverRoot": "/",
"nvidiaDevRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0627 10:12:47.515775 1 main.go:317] Retrieving plugins.
E0627 10:12:47.588061 1 factory.go:68] Failed to initialize NVML: Unknown Error.
E0627 10:12:47.588076 1 factory.go:69] If this is a GPU node, did you set the docker default runtime to `nvidia`?
E0627 10:12:47.588081 1 factory.go:70] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0627 10:12:47.588085 1 factory.go:71] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0627 10:12:47.588090 1 factory.go:72] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
W0627 10:12:47.588095 1 factory.go:76] nvml init failed: Unknown Error
I0627 10:12:47.588104 1 main.go:346] No devices found. Waiting indefinitely.
- No error case
$ k logs -n kube-system nvidia-device-plugin-daemonset-9b99j
I0627 10:15:39.141869 1 main.go:199] Starting FS watcher.
I0627 10:15:39.141951 1 main.go:206] Starting OS watcher.
I0627 10:15:39.142311 1 main.go:221] Starting Plugins.
I0627 10:15:39.142327 1 main.go:278] Loading configuration.
I0627 10:15:39.143124 1 main.go:303] Updating config with default resource matching patterns.
I0627 10:15:39.143382 1 main.go:314]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"mpsRoot": "",
"nvidiaDriverRoot": "/",
"nvidiaDevRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0627 10:15:39.143395 1 main.go:317] Retrieving plugins.
I0627 10:15:39.225905 1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I0627 10:15:39.226655 1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0627 10:15:39.229163 1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
Metadata
Metadata
Assignees
Labels
No labels