Skip to content

Failed to init NVML on NVIDIA k8s-device-plugin by update crun from v1.17 to v1.18+ #1802

@kanlkan

Description

@kanlkan

I use NVIDIA k8s-device-plugin to assign GPUs on Kubernetes. I found the issue that depends on the crun version. It failed to initialize NVML with v1.18+ crun. When this error occurred, it could not assign GPUs to the Pod.
(I don't use NVIDIA GPU Operator, I installed raw NVIDIA k8s-device-plugin as a daemonset.)

Could you investigate this issue? If you need any further logs, please let me know.

Environment

  • OS: Ubuntu 22.04
  • Kernel: 5.15.0-142-generic
  • Kubernetes: v1.30.14
  • CRI-O: v1.30.14
  • crun: built static binary with nix.
    • 1.16.1 ~ 1.18.2, 1.21
  • NVIDIA k8s-device-plugin: v0.16.1
  • NVIDIA Driver: 575.57.08
  • NVIDIA Container Toolkit: 1.17.8-1
  • NVIDIA CUDA: 12.6

I set nvidia as the default container runtime in the cri-o configuration.

$ cat /etc/crio/crio.conf.d/99-nvidia.conf

[crio]

  [crio.runtime]
    default_runtime = "nvidia"

    [crio.runtime.runtimes]

      [crio.runtime.runtimes.nvidia]
        monitor_path = "/usr/libexec/crio/conmon"
        runtime_path = "/usr/bin/nvidia-container-runtime"
        runtime_type = "oci"

And set crun as a low-level container runtime to use in nvidia-container-runtime.

$ cat /etc/nvidia-container-runtime/config.toml
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
#log-level = "info"
log-level = "debug"
mode = "auto"
#runtimes = ["/usr/libexec/crio/crun", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/runc", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.16.1", "docker-runc", "runc"]
runtimes = ["/usr/libexec/crio/crun-1.17", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.18", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.18.1", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.18.2", "docker-runc", "runc"]
#runtimes = ["/usr/libexec/crio/crun-1.21", "docker-runc", "runc"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime.modes.legacy]
cuda-compat-mode = "ldconfig"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"

Issue detailed

I tried to confirm the changing point of this issue about crun version.

  • v1.16.1: No error
  • v1.17: No error
  • v1.18: Error occurred
  • v1.18.1: Error occurred
  • v1.18.2: Error occurred
  • v1.21: Error occurred

So I think the changing point is v1.17 → v1.18

The error is shown in the log of NVIDIA k8s-device-plugin’s Pod.

  • Error occurred case
$ k logs -n kube-system nvidia-device-plugin-daemonset-vpxv9
I0627 10:12:47.514381       1 main.go:199] Starting FS watcher.
I0627 10:12:47.514475       1 main.go:206] Starting OS watcher.
I0627 10:12:47.514660       1 main.go:221] Starting Plugins.
I0627 10:12:47.514687       1 main.go:278] Loading configuration.
I0627 10:12:47.515500       1 main.go:303] Updating config with default resource matching patterns.
I0627 10:12:47.515761       1 main.go:314]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0627 10:12:47.515775       1 main.go:317] Retrieving plugins.
E0627 10:12:47.588061       1 factory.go:68] Failed to initialize NVML: Unknown Error.
E0627 10:12:47.588076       1 factory.go:69] If this is a GPU node, did you set the docker default runtime to `nvidia`?
E0627 10:12:47.588081       1 factory.go:70] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0627 10:12:47.588085       1 factory.go:71] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0627 10:12:47.588090       1 factory.go:72] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
W0627 10:12:47.588095       1 factory.go:76] nvml init failed: Unknown Error
I0627 10:12:47.588104       1 main.go:346] No devices found. Waiting indefinitely.
  • No error case
$ k logs -n kube-system nvidia-device-plugin-daemonset-9b99j
I0627 10:15:39.141869       1 main.go:199] Starting FS watcher.
I0627 10:15:39.141951       1 main.go:206] Starting OS watcher.
I0627 10:15:39.142311       1 main.go:221] Starting Plugins.
I0627 10:15:39.142327       1 main.go:278] Loading configuration.
I0627 10:15:39.143124       1 main.go:303] Updating config with default resource matching patterns.
I0627 10:15:39.143382       1 main.go:314]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0627 10:15:39.143395       1 main.go:317] Retrieving plugins.
I0627 10:15:39.225905       1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I0627 10:15:39.226655       1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0627 10:15:39.229163       1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions