You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You must use GKE version 1.22.17-gke.5400 or later.
You must ensure that you have enough quota for the underlying G2 Compute Engine machine type to use L4 GPUs.
The GKE version that you choose must include NVIDIA driver version 525 or later in Container-Optimized OS. If driver version 525 or later isn't the default or the latest version in your GKE version, you must manually install a supported driver on your nodes.
Daemonset drivers
For this I have configured a COS based g2-standard-12 nodepool which includes an L4 GPU by default and deployed it in my cluster.
I have ensured that I install the drivers mentioned in the documentation
just to see more or less the same errors in the driver installer
>kubectl -n kube-system logs -f -c nvidia-driver-installer -l k8s-app=nvidia-driver-installer
I0710 16:15:18.143027 12652 installer.go:175] Linking drivers...
I0710 16:15:18.143081 12652 installer.go:200] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia.ko /tmp/extract/kernel/precompiled/nv-linux.o /tmp/extract/kernel/nvidia/nv-kernel.o_binary]
I0710 16:15:18.067520 13595 installer.go:175] Linking drivers...
I0710 16:15:18.067586 13595 installer.go:200] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia.ko /tmp/extract/kernel/precompiled/nv-linux.o /tmp/extract/kernel/nvidia/nv-kernel.o_binary]
I0710 16:15:18.255090 12652 installer.go:211] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia-modeset.ko /tmp/extract/kernel/precompiled/nv-modeset-linux.o /tmp/extract/kernel/nvidia-modeset/nv-modeset-kernel.o_binary]
I0710 16:15:18.270294 12652 installer.go:234] Done linking drivers
I0710 16:15:18.181461 13595 installer.go:211] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia-modeset.ko /tmp/extract/kernel/precompiled/nv-modeset-linux.o /tmp/extract/kernel/nvidia-modeset/nv-modeset-kernel.o_binary]
I0710 16:15:18.196234 13595 installer.go:234] Done linking drivers
I0710 16:15:18.366874 13595 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:15:18.368433 13595 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:15:18.368451 13595 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:15:18.369846 13595 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
I0710 16:15:18.889544 13595 install.go:329] Failed to load kernel module, err: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1. Retrying driver installation with legacy linking
I0710 16:15:18.889577 13595 installer.go:303] Running GPU driver installer
I0710 16:15:18.440314 12652 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:15:18.441901 12652 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:15:18.441918 12652 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:15:18.443135 12652 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
I0710 16:15:18.953207 12652 install.go:329] Failed to load kernel module, err: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1. Retrying driver installation with legacy linking
I0710 16:15:18.953238 12652 installer.go:303] Running GPU driver installer
I0710 16:15:28.274192 13595 installer.go:143] Extracting precompiled artifacts...
I0710 16:15:28.313699 12652 installer.go:143] Extracting precompiled artifacts...
I0710 16:15:28.466472 13595 installer.go:170] Done extracting precompiled artifacts
I0710 16:15:28.466501 13595 installer.go:239] Linking drivers using legacy method...
I0710 16:15:28.466554 13595 installer.go:270] Installer arguments:
[/tmp/extract/nvidia-installer --utility-prefix=/usr/local/nvidia --opengl-prefix=/usr/local/nvidia --x-prefix=/usr/local/nvidia --install-libglvnd --no-install-compat32-libs --log-file-name=/usr/local/nvidia/nvidia-installer.log --silent --accept-license]
I0710 16:15:28.505712 12652 installer.go:170] Done extracting precompiled artifacts
I0710 16:15:28.505737 12652 installer.go:239] Linking drivers using legacy method...
I0710 16:15:28.505774 12652 installer.go:270] Installer arguments:
[/tmp/extract/nvidia-installer --utility-prefix=/usr/local/nvidia --opengl-prefix=/usr/local/nvidia --x-prefix=/usr/local/nvidia --install-libglvnd --no-install-compat32-libs --log-file-name=/usr/local/nvidia/nvidia-installer.log --silent --accept-license]
E0710 16:15:29.007506 13595 utils.go:355]
E0710 16:15:29.008106 13595 utils.go:355] ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
E0710 16:15:29.008128 13595 utils.go:355]
E0710 16:15:29.008136 13595 utils.go:355] Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/usr/local/nvidia/nvidia-installer.log' for more information.
E0710 16:15:29.008143 13595 utils.go:355]
E0710 16:15:29.030758 13595 utils.go:355]
E0710 16:15:29.030788 13595 utils.go:355] ERROR: Installation has failed. Please see the file '/usr/local/nvidia/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
E0710 16:15:29.030792 13595 utils.go:355]
I0710 16:15:29.031035 13595 installer.go:272] Done linking drivers
E0710 16:15:29.034164 12652 utils.go:355]
E0710 16:15:29.034910 12652 utils.go:355] ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
E0710 16:15:29.034929 12652 utils.go:355]
E0710 16:15:29.034937 12652 utils.go:355] Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/usr/local/nvidia/nvidia-installer.log' for more information.
E0710 16:15:29.034942 12652 utils.go:355]
E0710 16:15:29.056397 12652 utils.go:355]
E0710 16:15:29.056422 12652 utils.go:355] ERROR: Installation has failed. Please see the file '/usr/local/nvidia/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
E0710 16:15:29.056427 12652 utils.go:355]
I0710 16:15:29.056712 12652 installer.go:272] Done linking drivers
I0710 16:15:29.227756 13595 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:15:29.229349 13595 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:15:29.229375 13595 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:15:29.230874 13595 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
I0710 16:15:29.230297 12652 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:15:29.231998 12652 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:15:29.232023 12652 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:15:29.233532 12652 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
E0710 16:15:29.575466 13595 install.go:356] failed to run GPU driver installer: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1
E0710 16:15:29.580685 12652 install.go:356] failed to run GPU driver installer: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1
(the logs are duplicate because the nodes are 2 and the nvidia-driver-installer pods are also 2).
Daemonset 525 drivers (installed manually)
I have also tried fetching the latest daemonset locally and edit it in order to install a specific version of the 525 driver (tried all of them, they all failed with the same error)
>kubectl -n kube-system logs -f -c nvidia-driver-installer -l k8s-app=nvidia-driver-installer
I0710 17:11:04.007041 37416 install.go:205] Installing GPU driver from "https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/97/tesla/525_00/525.105.17/NVIDIA-Linux-x86_64-525.105.17_97-16919-294-48.cos"
I0710 17:11:04.007442 37416 cos.go:31] Checking kernel module signing.
I0710 17:11:04.007462 37416 installer.go:106] Configuring driver installation directories
I0710 17:11:04.033106 37416 utils.go:88] Downloading toolchain_env from https://storage.googleapis.com/cos-tools/16919.235.1/toolchain_env
I0710 17:11:04.106217 37416 cos.go:73] Installing the toolchain
I0710 17:11:04.106338 37416 cos.go:89] Found existing toolchain. Skipping download and installation.
I0710 17:11:04.106353 37416 cos.go:102] Found existing kernel headers. Skipping download and installation.
I0710 17:11:04.106377 37416 utils.go:88] Downloading Unofficial GPU driver installer from https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/97/tesla/525_00/525.105.17/NVIDIA-Linux-x86_64-525.105.17_97-16919-294-48.cos
I0710 17:11:08.647335 37416 installer.go:303] Running GPU driver installer
I0710 17:11:20.902272 37416 installer.go:143] Extracting precompiled artifacts...
I0710 17:11:21.131986 37416 installer.go:170] Done extracting precompiled artifacts
I0710 17:11:21.132014 37416 installer.go:175] Linking drivers...
I0710 17:11:21.132068 37416 installer.go:200] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia.ko /tmp/extract/kernel/precompiled/nv-linux.o /tmp/extract/kernel/nvidia/nv-kernel.o_binary]
I0710 17:11:21.282000 37416 installer.go:211] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia-modeset.ko /tmp/extract/kernel/precompiled/nv-modeset-linux.o /tmp/extract/kernel/nvidia-modeset/nv-modeset-kernel.o_binary]
I0710 17:11:21.297718 37416 installer.go:234] Done linking drivers
I0710 17:11:21.485882 37416 install.go:329] Failed to load kernel module, err: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1. Retrying driver installation with legacy linking
I0710 17:11:21.485910 37416 installer.go:303] Running GPU driver installer
I0710 17:11:33.823939 37416 installer.go:143] Extracting precompiled artifacts...
I0710 17:11:34.042251 37416 installer.go:170] Done extracting precompiled artifacts
I0710 17:11:34.042281 37416 installer.go:239] Linking drivers using legacy method...
I0710 17:11:34.042330 37416 installer.go:270] Installer arguments:
[/tmp/extract/nvidia-installer --utility-prefix=/usr/local/nvidia --opengl-prefix=/usr/local/nvidia --x-prefix=/usr/local/nvidia --install-libglvnd --no-install-compat32-libs --log-file-name=/usr/local/nvidia/nvidia-installer.log --silent --accept-license]
E0710 17:11:34.155440 37416 utils.go:355]
E0710 17:11:34.155925 37416 utils.go:355] ERROR: Unable to find the kernel source tree for the currently running kernel. Please make sure you have installed the kernel source files for your kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have the 'kernel-source' or 'kernel-devel' RPM installed. If you know the correct kernel source files are installed, you may specify the kernel source path with the '--kernel-source-path' command line option.
E0710 17:11:34.155940 37416 utils.go:355]
E0710 17:11:34.155944 37416 utils.go:355]
E0710 17:11:34.155948 37416 utils.go:355] ERROR: Installation has failed. Please see the file '/usr/local/nvidia/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
E0710 17:11:34.155953 37416 utils.go:355]
I0710 17:11:34.155986 37416 installer.go:272] Done linking drivers
E0710 17:11:34.209832 37416 install.go:356] failed to run GPU driver installer: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1
Conclusion
I didn't have such issues with other GPU types in the past. I have now switched to P100 on ubuntu but I am really interested in using G2 with L4 as it is a better fit for our use case.
Is there any way to have G2 with L4 GPU with a working driver in GKE with either Ubuntu or a COS image type?
The text was updated successfully, but these errors were encountered:
Description
I am trying to use G2 with L4 GPU in GKE
GKE Control Plane version
v1.24.12-gke.500
Nodepool version version
v1.24.9-gke.3200
Based on the documentation, these are the requirements of installing L4 GPU in Kubernetes.
Requirements
L4 GPUs:
Daemonset drivers
For this I have configured a COS based g2-standard-12 nodepool which includes an L4 GPU by default and deployed it in my cluster.
I have ensured that I install the drivers mentioned in the documentation
Then i noticed the pods are in a
CrashLoopBackOff
stateLogs
Latest Daemonset drivers
I then installed the latest daemonset
just to see more or less the same errors in the driver installer
(the logs are duplicate because the nodes are 2 and the
nvidia-driver-installer
pods are also 2).Daemonset 525 drivers (installed manually)
I have also tried fetching the latest daemonset locally and edit it in order to install a specific version of the 525 driver (tried all of them, they all failed with the same error)
and the driver installation failed again
Conclusion
I didn't have such issues with other GPU types in the past. I have now switched to P100 on ubuntu but I am really interested in using G2 with L4 as it is a better fit for our use case.
Is there any way to have G2 with L4 GPU with a working driver in GKE with either Ubuntu or a COS image type?
The text was updated successfully, but these errors were encountered: