Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EGL Initialization Failure with K8s Device Plugin #1140

Open
ryan-brigden-ai opened this issue Jan 28, 2025 · 2 comments
Open

EGL Initialization Failure with K8s Device Plugin #1140

ryan-brigden-ai opened this issue Jan 28, 2025 · 2 comments

Comments

@ryan-brigden-ai
Copy link

Overview

We're running an application that uses Nvidia graphics capabilities and are trying to get this running in K8s. With the 0.17 version of the device plugin, we cannot initialize EGL that our application relies on to access the Nvidia device.

Test cases:

  • In K8s pod with GPU (nvidia-smi works, but eglinfo fails).
  • In container created on same host with docker (nvidia-smi works and eglinfo works).

Expected behavior:

Initialize EGL to leverage Nvidia GPU in K8s pod. eglinfo should return information about the Nvidia device.

Reproduction

Pre-requisites

  • K8s cluster with Nvidia GPUs and latest Nvidia device plugin (0.17)

Steps

  1. Create a pod
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-test-pod
  namespace: default
spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  containers:
    - name: nvidia-test-container
      image: nvidia/opengl:1.2-glvnd-runtime
      resources:
        limits:
          nvidia.com/gpu: 1
      env:
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: "all"
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
      command:
        - /bin/bash
        - "-c"
        - "sleep infinity"
  1. Get a shell in the pod and install eglinfo.
  2. Run eglinfo. Notice the output
Device platform:
eglinfo: eglInitialize failed

We would expect the output to be

Device platform:
EGL API version: 1.5
EGL vendor string: NVIDIA
EGL version string: 1.5
EGL client APIs: OpenGL_ES OpenGL
EGL extensions string:

My notes

I think it is likely that this code path is not being exercised by the plugin, which is being exercised by the nvidia-container-toolkit: https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/internal/discover/graphics.go#L52

@elezar
Copy link
Member

elezar commented Jan 28, 2025

Could you provide information on how the device plugin is configured?

What is the container runtime used on your K8s cluster?

@ryan-brigden-ai
Copy link
Author

ryan-brigden-ai commented Jan 28, 2025

Could you provide information on how the device plugin is configured?

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

What is the container runtime used on your K8s cluster?

We have seen the issue both with containerd and cri-o. We are primarily interested in cri-o

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants