Failed to init NVML on NVIDIA k8s-device-plugin by update crun from v1.17 to v1.18+

I use NVIDIA [k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) to assign GPUs on Kubernetes. I found the issue that depends on the crun version. It failed to initialize NVML with v1.18+ crun. When this error occurred, it could not assign GPUs to the Pod.
(I don't use NVIDIA GPU Operator, I installed raw NVIDIA k8s-device-plugin as a daemonset.)

Could you investigate this issue? If you need any further logs, please let me know.

## Environment

* OS: Ubuntu 22.04
* Kernel: 5.15.0-142-generic
* Kubernetes: v1.30.14
* CRI-O: v1.30.14
* crun: built static binary with nix. 
  * 1.16.1 ~ 1.18.2, 1.21
* NVIDIA k8s-device-plugin: v0.16.1
* NVIDIA Driver: 575.57.08
* NVIDIA Container Toolkit: 1.17.8-1
* NVIDIA CUDA: 12.6

I set `nvidia` as the default container runtime in the cri-o configuration.
```
$ cat /etc/crio/crio.conf.d/99-nvidia.conf

[crio]

  [crio.runtime]
    default_runtime = "nvidia"

    [crio.runtime.runtimes]

      [crio.runtime.runtimes.nvidia]
        monitor_path = "/usr/libexec/crio/conmon"
        runtime_path = "/usr/bin/nvidia-container-runtime"
        runtime_type = "oci"
```

And set crun as a low-level container runtime to use in nvidia-container-runtime.

```
$ cat /etc/nvidia-container-runtime/config.toml
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
#log-level = "info"
log-level = "debug"
mode = "auto"
#runtimes = ["/usr/libexec/crio/crun", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/runc", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.16.1", "docker-runc", "runc"]
runtimes = ["/usr/libexec/crio/crun-1.17", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.18", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.18.1", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.18.2", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.21", "docker-runc", "runc"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime.modes.legacy]
cuda-compat-mode = "ldconfig"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"
```

## Issue detailed

I tried to confirm the changing point of this issue about crun version.

* v1.16.1: No error
* v1.17: No error
* v1.18: Error occurred
* v1.18.1: Error occurred
* v1.18.2: Error occurred
* v1.21: Error occurred

So I think the changing point is v1.17 → v1.18

The error is shown in the log of NVIDIA k8s-device-plugin’s Pod.

* Error occurred case
```
$ k logs -n kube-system nvidia-device-plugin-daemonset-vpxv9
I0627 10:12:47.514381       1 main.go:199] Starting FS watcher.
I0627 10:12:47.514475       1 main.go:206] Starting OS watcher.
I0627 10:12:47.514660       1 main.go:221] Starting Plugins.
I0627 10:12:47.514687       1 main.go:278] Loading configuration.
I0627 10:12:47.515500       1 main.go:303] Updating config with default resource matching patterns.
I0627 10:12:47.515761       1 main.go:314]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0627 10:12:47.515775       1 main.go:317] Retrieving plugins.
E0627 10:12:47.588061       1 factory.go:68] Failed to initialize NVML: Unknown Error.
E0627 10:12:47.588076       1 factory.go:69] If this is a GPU node, did you set the docker default runtime to `nvidia`?
E0627 10:12:47.588081       1 factory.go:70] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0627 10:12:47.588085       1 factory.go:71] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0627 10:12:47.588090       1 factory.go:72] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
W0627 10:12:47.588095       1 factory.go:76] nvml init failed: Unknown Error
I0627 10:12:47.588104       1 main.go:346] No devices found. Waiting indefinitely.
```

* No error case

```
$ k logs -n kube-system nvidia-device-plugin-daemonset-9b99j
I0627 10:15:39.141869       1 main.go:199] Starting FS watcher.
I0627 10:15:39.141951       1 main.go:206] Starting OS watcher.
I0627 10:15:39.142311       1 main.go:221] Starting Plugins.
I0627 10:15:39.142327       1 main.go:278] Loading configuration.
I0627 10:15:39.143124       1 main.go:303] Updating config with default resource matching patterns.
I0627 10:15:39.143382       1 main.go:314]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0627 10:15:39.143395       1 main.go:317] Retrieving plugins.
I0627 10:15:39.225905       1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I0627 10:15:39.226655       1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0627 10:15:39.229163       1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failed to init NVML on NVIDIA k8s-device-plugin by update crun from v1.17 to v1.18+ #1802

Environment

Issue detailed

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failed to init NVML on NVIDIA k8s-device-plugin by update crun from v1.17 to v1.18+ #1802

Description

Environment

Issue detailed

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions