Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-driver-installer fails to install drivers for G2 instance type with L4 #302

Open
christidis opened this issue Jul 11, 2023 · 3 comments

Comments

@christidis
Copy link

christidis commented Jul 11, 2023

Description

I am trying to use G2 with L4 GPU in GKE

GKE Control Plane version v1.24.12-gke.500
Nodepool version version v1.24.9-gke.3200

Based on the documentation, these are the requirements of installing L4 GPU in Kubernetes.

Requirements

L4 GPUs:

  • You must use GKE version 1.22.17-gke.5400 or later.
  • You must ensure that you have enough quota for the underlying G2 Compute Engine machine type to use L4 GPUs.
  • The GKE version that you choose must include NVIDIA driver version 525 or later in Container-Optimized OS. If driver version 525 or later isn't the default or the latest version in your GKE version, you must manually install a supported driver on your nodes.

Daemonset drivers

For this I have configured a COS based g2-standard-12 nodepool which includes an L4 GPU by default and deployed it in my cluster.

I have ensured that I install the drivers mentioned in the documentation

>kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Then i noticed the pods are in a CrashLoopBackOff state

>kubectl -n kube-system get pods  -l k8s-app=nvidia-driver-installer 
NAME                            READY   STATUS                  RESTARTS      AGE
nvidia-driver-installer-8s57b   0/1     Init:CrashLoopBackOff   5 (49s ago)   6m19s
nvidia-driver-installer-g55lh   0/1     Init:CrashLoopBackOff   5 (52s ago)   6m24s

Logs

>kubectl -n kube-system logs -f -c nvidia-driver-installer -l k8s-app=nvidia-driver-installer  
E0710 16:10:43.531586   12050 utils.go:355] 
E0710 16:10:43.552776   12050 utils.go:355] 
E0710 16:10:52.897020   11343 utils.go:355] 
E0710 16:10:52.917705   11343 utils.go:355] 
E0710 16:10:52.917725   11343 utils.go:355] ERROR: Installation has failed.  Please see the file '/usr/local/nvidia/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
E0710 16:10:43.552800   12050 utils.go:355] ERROR: Installation has failed.  Please see the file '/usr/local/nvidia/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
E0710 16:10:43.552804   12050 utils.go:355] 
I0710 16:10:43.553060   12050 installer.go:272] Done linking drivers
I0710 16:10:43.721063   12050 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:10:43.722880   12050 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:10:43.722904   12050 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:10:43.724544   12050 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
E0710 16:10:44.054791   12050 install.go:356] failed to run GPU driver installer: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1
E0710 16:10:52.917729   11343 utils.go:355] 
I0710 16:10:52.917962   11343 installer.go:272] Done linking drivers
I0710 16:10:53.094929   11343 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:10:53.096824   11343 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:10:53.096845   11343 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:10:53.098248   11343 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
E0710 16:10:53.419759   11343 install.go:356] failed to run GPU driver installer: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1

Latest Daemonset drivers

I then installed the latest daemonset

>kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

just to see more or less the same errors in the driver installer

>kubectl -n kube-system logs -f -c nvidia-driver-installer -l k8s-app=nvidia-driver-installer  
I0710 16:15:18.143027   12652 installer.go:175] Linking drivers...
I0710 16:15:18.143081   12652 installer.go:200] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia.ko /tmp/extract/kernel/precompiled/nv-linux.o /tmp/extract/kernel/nvidia/nv-kernel.o_binary]
I0710 16:15:18.067520   13595 installer.go:175] Linking drivers...
I0710 16:15:18.067586   13595 installer.go:200] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia.ko /tmp/extract/kernel/precompiled/nv-linux.o /tmp/extract/kernel/nvidia/nv-kernel.o_binary]
I0710 16:15:18.255090   12652 installer.go:211] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia-modeset.ko /tmp/extract/kernel/precompiled/nv-modeset-linux.o /tmp/extract/kernel/nvidia-modeset/nv-modeset-kernel.o_binary]
I0710 16:15:18.270294   12652 installer.go:234] Done linking drivers
I0710 16:15:18.181461   13595 installer.go:211] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia-modeset.ko /tmp/extract/kernel/precompiled/nv-modeset-linux.o /tmp/extract/kernel/nvidia-modeset/nv-modeset-kernel.o_binary]
I0710 16:15:18.196234   13595 installer.go:234] Done linking drivers
I0710 16:15:18.366874   13595 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:15:18.368433   13595 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:15:18.368451   13595 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:15:18.369846   13595 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
I0710 16:15:18.889544   13595 install.go:329] Failed to load kernel module, err: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1. Retrying driver installation with legacy linking
I0710 16:15:18.889577   13595 installer.go:303] Running GPU driver installer
I0710 16:15:18.440314   12652 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:15:18.441901   12652 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:15:18.441918   12652 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:15:18.443135   12652 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
I0710 16:15:18.953207   12652 install.go:329] Failed to load kernel module, err: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1. Retrying driver installation with legacy linking
I0710 16:15:18.953238   12652 installer.go:303] Running GPU driver installer
I0710 16:15:28.274192   13595 installer.go:143] Extracting precompiled artifacts...
I0710 16:15:28.313699   12652 installer.go:143] Extracting precompiled artifacts...
I0710 16:15:28.466472   13595 installer.go:170] Done extracting precompiled artifacts
I0710 16:15:28.466501   13595 installer.go:239] Linking drivers using legacy method...
I0710 16:15:28.466554   13595 installer.go:270] Installer arguments:
[/tmp/extract/nvidia-installer --utility-prefix=/usr/local/nvidia --opengl-prefix=/usr/local/nvidia --x-prefix=/usr/local/nvidia --install-libglvnd --no-install-compat32-libs --log-file-name=/usr/local/nvidia/nvidia-installer.log --silent --accept-license]
I0710 16:15:28.505712   12652 installer.go:170] Done extracting precompiled artifacts
I0710 16:15:28.505737   12652 installer.go:239] Linking drivers using legacy method...
I0710 16:15:28.505774   12652 installer.go:270] Installer arguments:
[/tmp/extract/nvidia-installer --utility-prefix=/usr/local/nvidia --opengl-prefix=/usr/local/nvidia --x-prefix=/usr/local/nvidia --install-libglvnd --no-install-compat32-libs --log-file-name=/usr/local/nvidia/nvidia-installer.log --silent --accept-license]
E0710 16:15:29.007506   13595 utils.go:355] 
E0710 16:15:29.008106   13595 utils.go:355] ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
E0710 16:15:29.008128   13595 utils.go:355] 
E0710 16:15:29.008136   13595 utils.go:355] Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/usr/local/nvidia/nvidia-installer.log' for more information.
E0710 16:15:29.008143   13595 utils.go:355] 
E0710 16:15:29.030758   13595 utils.go:355] 
E0710 16:15:29.030788   13595 utils.go:355] ERROR: Installation has failed.  Please see the file '/usr/local/nvidia/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
E0710 16:15:29.030792   13595 utils.go:355] 
I0710 16:15:29.031035   13595 installer.go:272] Done linking drivers
E0710 16:15:29.034164   12652 utils.go:355] 
E0710 16:15:29.034910   12652 utils.go:355] ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
E0710 16:15:29.034929   12652 utils.go:355] 
E0710 16:15:29.034937   12652 utils.go:355] Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/usr/local/nvidia/nvidia-installer.log' for more information.
E0710 16:15:29.034942   12652 utils.go:355] 
E0710 16:15:29.056397   12652 utils.go:355] 
E0710 16:15:29.056422   12652 utils.go:355] ERROR: Installation has failed.  Please see the file '/usr/local/nvidia/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
E0710 16:15:29.056427   12652 utils.go:355] 
I0710 16:15:29.056712   12652 installer.go:272] Done linking drivers
I0710 16:15:29.227756   13595 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:15:29.229349   13595 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:15:29.229375   13595 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:15:29.230874   13595 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
I0710 16:15:29.230297   12652 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:15:29.231998   12652 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:15:29.232023   12652 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:15:29.233532   12652 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
E0710 16:15:29.575466   13595 install.go:356] failed to run GPU driver installer: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1
E0710 16:15:29.580685   12652 install.go:356] failed to run GPU driver installer: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1

(the logs are duplicate because the nodes are 2 and the nvidia-driver-installer pods are also 2).

Daemonset 525 drivers (installed manually)

I have also tried fetching the latest daemonset locally and edit it in order to install a specific version of the 525 driver (tried all of them, they all failed with the same error)

        command: ['/cos-gpu-installer', 'install', '--allow-unsigned-driver', '--nvidia-installer-url=https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/93/tesla/525_00/525.85.12/NVIDIA-Linux-x86_64-525.85.12_93-16623-341-8.cos']

and the driver installation failed again

>kubectl -n kube-system logs -f -c nvidia-driver-installer -l k8s-app=nvidia-driver-installer  
I0710 17:11:04.007041   37416 install.go:205] Installing GPU driver from "https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/97/tesla/525_00/525.105.17/NVIDIA-Linux-x86_64-525.105.17_97-16919-294-48.cos"
I0710 17:11:04.007442   37416 cos.go:31] Checking kernel module signing.
I0710 17:11:04.007462   37416 installer.go:106] Configuring driver installation directories
I0710 17:11:04.033106   37416 utils.go:88] Downloading toolchain_env from https://storage.googleapis.com/cos-tools/16919.235.1/toolchain_env
I0710 17:11:04.106217   37416 cos.go:73] Installing the toolchain
I0710 17:11:04.106338   37416 cos.go:89] Found existing toolchain. Skipping download and installation.
I0710 17:11:04.106353   37416 cos.go:102] Found existing kernel headers. Skipping download and installation.
I0710 17:11:04.106377   37416 utils.go:88] Downloading Unofficial GPU driver installer from https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/97/tesla/525_00/525.105.17/NVIDIA-Linux-x86_64-525.105.17_97-16919-294-48.cos
I0710 17:11:08.647335   37416 installer.go:303] Running GPU driver installer


I0710 17:11:20.902272   37416 installer.go:143] Extracting precompiled artifacts...
I0710 17:11:21.131986   37416 installer.go:170] Done extracting precompiled artifacts
I0710 17:11:21.132014   37416 installer.go:175] Linking drivers...
I0710 17:11:21.132068   37416 installer.go:200] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia.ko /tmp/extract/kernel/precompiled/nv-linux.o /tmp/extract/kernel/nvidia/nv-kernel.o_binary]
I0710 17:11:21.282000   37416 installer.go:211] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia-modeset.ko /tmp/extract/kernel/precompiled/nv-modeset-linux.o /tmp/extract/kernel/nvidia-modeset/nv-modeset-kernel.o_binary]
I0710 17:11:21.297718   37416 installer.go:234] Done linking drivers
I0710 17:11:21.485882   37416 install.go:329] Failed to load kernel module, err: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1. Retrying driver installation with legacy linking
I0710 17:11:21.485910   37416 installer.go:303] Running GPU driver installer


I0710 17:11:33.823939   37416 installer.go:143] Extracting precompiled artifacts...
I0710 17:11:34.042251   37416 installer.go:170] Done extracting precompiled artifacts
I0710 17:11:34.042281   37416 installer.go:239] Linking drivers using legacy method...
I0710 17:11:34.042330   37416 installer.go:270] Installer arguments:
[/tmp/extract/nvidia-installer --utility-prefix=/usr/local/nvidia --opengl-prefix=/usr/local/nvidia --x-prefix=/usr/local/nvidia --install-libglvnd --no-install-compat32-libs --log-file-name=/usr/local/nvidia/nvidia-installer.log --silent --accept-license]
E0710 17:11:34.155440   37416 utils.go:355] 
E0710 17:11:34.155925   37416 utils.go:355] ERROR: Unable to find the kernel source tree for the currently running kernel.  Please make sure you have installed the kernel source files for your kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have the 'kernel-source' or 'kernel-devel' RPM installed.  If you know the correct kernel source files are installed, you may specify the kernel source path with the '--kernel-source-path' command line option.
E0710 17:11:34.155940   37416 utils.go:355] 
E0710 17:11:34.155944   37416 utils.go:355] 
E0710 17:11:34.155948   37416 utils.go:355] ERROR: Installation has failed.  Please see the file '/usr/local/nvidia/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
E0710 17:11:34.155953   37416 utils.go:355] 
I0710 17:11:34.155986   37416 installer.go:272] Done linking drivers
E0710 17:11:34.209832   37416 install.go:356] failed to run GPU driver installer: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1

Conclusion

I didn't have such issues with other GPU types in the past. I have now switched to P100 on ubuntu but I am really interested in using G2 with L4 as it is a better fit for our use case.

Is there any way to have G2 with L4 GPU with a working driver in GKE with either Ubuntu or a COS image type?

@JulesBelveze
Copy link

+1

@MinaMaher0
Copy link

worked for me after adding this part to the part to the node pool terraform code

guest_accelerator {
      type  = "nvidia-l4"
      count = 1
    }

and installed the new version of the GPU driver

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

@mpagnucco
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants