Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kmod-5.15-nvidia sources #2455

Merged
merged 1 commit into from
Sep 30, 2022

Conversation

arnaldo2792
Copy link
Contributor

@arnaldo2792 arnaldo2792 commented Sep 26, 2022

Issue number:
Part of #2374

Description of changes:

packages: add kmod-5.15-nvidia sources

This change is required to release k8s-1.24-nvidia variants, since the 470 NVIDIA driver does not work with kernels > 5.10.

Testing done:
This is just a cherry pick from #2286, the only difference is the driver version, same testing applied, I ran a local variant that uses this driver and confirmed that the pods can access the GPUs:

❯ kubectl exec gpu-tests-d5x9v -- nvidia-smi
Mon Sep 26 16:42:08 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   40C    P8    15W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
❯ kubectl exec gpu-tests-d5x9v -- uname -a
Linux gpu-tests-d5x9v 5.15.54 #1 SMP Mon Sep 26 16:12:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

For aarch64, I'm having problems with making a node joining a cluster (unrelated to this PR), but I can see the driver is working:

bash-5.1# /usr/libexec/nvidia/tesla/bin/515.65.01/nvidia-smi
Mon Sep 26 18:47:48 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T4G          Off  | 00000000:00:1F.0 Off |                    0 |
| N/A   36C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
bash-5.1# uname -a
Linux ip-192-168-36-119.us-west-2.compute.internal 5.15.54 #1 SMP Mon Sep 26 17:30:22 UTC 2022 aarch64 GNU/Linux

I verified that the GSP firmware was loaded in the supported architectures, as part of #2286:

  • Instance Type: p2
  • GPU Model: Tesla k80
  • GPU Architecture: Kepler 2.0
  • The driver failed to be loaded:
[   32.672564] NVRM: The NVIDIA Tesla K80 GPU installed in this system is
[   32.672564] NVRM:  supported through the NVIDIA 470.xx Legacy drivers
  • Instance Type: p3
  • GPU Model: NVIDIA V100
  • GPU Architecture: NVIDIA Volta
  • GSP wasn't used (N/A means the firmware binary wasn't loaded):
bash-5.1# /usr/libexec/nvidia/tesla/bin/515.48.07/nvidia-smi -q | grep GSPGSP Firmware Version                  : N/A
  • Instance Type: g4dn
  • GPU Model: NVIDIA T4
  • GPU Architecture: NVIDIA Turing
  • GSP was used:
bash-5.1# /usr/libexec/nvidia/tesla/bin/515.48.07/nvidia-smi -q | grep GSPGSP Firmware Version                  : 515.48.07
  • Instance Type: g5g
  • GPU Model: NVIDIA T4G
  • GPU Architecture: NVIDIA Turing
  • GSP was used:
bash-5.1# /usr/libexec/nvidia/tesla/bin/515.48.07/nvidia-smi -q | grep GSPGSP Firmware Version                  : 515.48.07

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

soname="$(%{_cross_target}-readelf -d "${lib}" | awk '/SONAME/{print $5}' | tr -d '[]')"
[ -n "${soname}" ] || continue
[ "${lib}" == "${soname}" ] && continue
[ -e %{buildroot}/%{tesla_515_libdir}/"${soname}" ] && continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What file is this guard catching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few libraries in the binary for which the soname'd file is provided in the archive:

libnvidia-gtk3.so.515.65.01
libnvidia-eglcore.so.515.65.01
libnvidia-rtcore.so.515.65.01
libGLX.so.0
libnvidia-glvkspirv.so.515.65.01
libnvidia-tls.so.515.65.01
libnvidia-gtk2.so.515.65.01
libGLdispatch.so.0
libnvidia-wayland-client.so.515.65.01
libnvidia-compiler.so.515.65.01
libnvidia-glsi.so.515.65.01
libnvidia-glcore.so.515.65.01
libOpenGL.so.0

However, there are a few that still require the symlink:

libGLESv2_nvidia.so.515.65.01
libnvoptix.so.515.65.01
libnvidia-egl-wayland.so.1.1.9
libGL.so.1.7.0
libnvidia-allocator.so.515.65.01
libvdpau_nvidia.so.515.65.01
libnvidia-ngx.so.515.65.01
libEGL.so.1.1.0
libnvidia-nvvm.so.515.65.01
libnvidia-encode.so.515.65.01
libGLX_nvidia.so.515.65.01
libGLESv2.so.2.1.0
libnvidia-egl-gbm.so.1.1.0
libEGL.so.515.65.01
libOpenCL.so.1.0.0
libnvidia-fbc.so.515.65.01
libnvidia-opticalflow.so.515.65.01
libGLESv1_CM.so.1.2.0
libGLESv1_CM_nvidia.so.515.65.01
libnvidia-cfg.so.515.65.01
libnvidia-ptxjitcompiler.so.515.65.01
libnvidia-opencl.so.515.65.01
libnvidia-ml.so.515.65.01
libcuda.so.515.65.01
libEGL_nvidia.so.515.65.01
libnvcuvid.so.515.65.01

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few libraries in the binary for which the soname'd file is provided in the archive

The "file" or the "symlink"? If we test (via [ -L ${link} ]) that it is already a link then OK. If it is a regular file with the same name as the library, then that is not really OK, because it creates ambiguity as to which one the dynamic loader will select.

I would tend to prefer testing that it's a link and then removing it and recreating our own. If it exists and it's not a link, do something else - diff it against the target and remove it if they're the same, then create our link. Otherwise it's an exceptional case and I need more details to advise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a mistake while I was evaluating the symlinks that existed for each library (fish-shell bit me 💢 ). Anyways, I confirmed that neither the symlinks nor the files exist in the NVIDIA run archives for these libraries

Missing 'libnvoptix.so.1' for 'libnvoptix.so.515.65.01'
Missing 'libnvidia-egl-wayland.so.1' for 'libnvidia-egl-wayland.so.1.1.9'
Missing 'libGL.so.1' for 'libGL.so.1.7.0'
Missing 'libnvidia-allocator.so.1' for 'libnvidia-allocator.so.515.65.01'
Missing 'libvdpau_nvidia.so.1' for 'libvdpau_nvidia.so.515.65.01'
Missing 'libnvidia-ngx.so.1' for 'libnvidia-ngx.so.515.65.01'
Missing 'libEGL.so.1' for 'libEGL.so.1.1.0'
Missing 'libnvidia-nvvm.so.4' for 'libnvidia-nvvm.so.515.65.01'
Missing 'libnvidia-encode.so.1' for 'libnvidia-encode.so.515.65.01'
Missing 'libGLX_nvidia.so.0' for 'libGLX_nvidia.so.515.65.01'
Missing 'libGLESv2.so.2' for 'libGLESv2.so.2.1.0'
Missing 'libnvidia-egl-gbm.so.1' for 'libnvidia-egl-gbm.so.1.1.0'
Missing 'libEGL.so.1' for 'libEGL.so.515.65.01'
Missing 'libOpenCL.so.1' for 'libOpenCL.so.1.0.0'
Missing 'libnvidia-fbc.so.1' for 'libnvidia-fbc.so.515.65.01'
Missing 'libnvidia-opticalflow.so.1' for 'libnvidia-opticalflow.so.515.65.01'
Missing 'libGLESv1_CM.so.1' for 'libGLESv1_CM.so.1.2.0'
Missing 'libGLESv1_CM_nvidia.so.1' for 'libGLESv1_CM_nvidia.so.515.65.01'
Missing 'libnvidia-cfg.so.1' for 'libnvidia-cfg.so.515.65.01'
Missing 'libnvidia-ptxjitcompiler.so.1' for 'libnvidia-ptxjitcompiler.so.515.65.01'
Missing 'libnvidia-opencl.so.1' for 'libnvidia-opencl.so.515.65.01'
Missing 'libnvidia-ml.so.1' for 'libnvidia-ml.so.515.65.01'
Missing 'libcuda.so.1' for 'libcuda.so.515.65.01'
Missing 'libEGL_nvidia.so.0' for 'libEGL_nvidia.so.515.65.01'
Missing 'libnvcuvid.so.1' for 'libnvcuvid.so.515.65.01'

This is the script that I used to verify which libraries are missing their SONAME symlink, and which don't need it:

#! /usr/bin/env bash

for lib in $(find . -maxdepth 1 -type f -name 'lib*.so.*' -printf '%P\n'); do
  soname="$(readelf -d "${lib}" | awk '/SONAME/{print $5}' | tr -d '[]')"
  [ -n "${soname}" ] || continue
  [ "${lib}" == "${soname}" ] && continue
  [ ! -e "${soname}" ] && echo "Missing '${soname}' for '${lib}'"
done

Comment on lines +1 to +4
%global tesla_515 515.65.01
%global tesla_515_libdir %{_cross_libdir}/nvidia/tesla/%{tesla_515}
%global tesla_515_bindir %{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_515}
%global tesla_515_firmwaredir %{_cross_libdir}/firmware/nvidia/%{tesla_515}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for other reviewers: I checked out this branch and then ran:

git diff --no-index --word-diff packages/kmod-*-nvidia/kmod-*-nvidia.spec

There's a lot of churn in the diff that stems from changing %{tesla_470} to %{tesla_515}. That's somewhat unavoidable here but in the interests of simplifying future diffs, it might be good to go ahead and rename this macro to tesla_ver. That way the next diff will be easier to examine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this for both kmod packages in another PR 👍

@bcressey
Copy link
Contributor

From the commit message:

The driver will use the GPU System Processor (GSP) feature if the underlying hardware supports it by loading the binary file /lib/firmware/nvidia/<version>/gsp.bin.

Have you confirmed this works as expected?

@arnaldo2792
Copy link
Contributor Author

Yes, I'll update the description with my testing

@arnaldo2792 arnaldo2792 marked this pull request as ready for review September 29, 2022 17:39
packages/kmod-5.15-nvidia/kmod-5.15-nvidia.spec Outdated Show resolved Hide resolved
@arnaldo2792
Copy link
Contributor Author

Forced push includes:

  • Remove extra NV_VERBOSE flag while compiling the kernel modules
  • Remove unnecessary guard while creating SONAME symlinks

This adds the sources to compile the 515 NVIDIA driver for the 5.15
kernel.  This version only supports the GPU architectures Maxwell,
Pascal, Volta, Turing, Ampere, and forward.  The driver will use the GPU
System Processor (GSP) feature if the underlying hardware supports it
by loading the binary file `/lib/firmware/nvidia/<version>/gsp.bin`.

Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
@arnaldo2792
Copy link
Contributor Author

( Forced push removed strange commit )

@arnaldo2792 arnaldo2792 merged commit b122ea1 into bottlerocket-os:develop Sep 30, 2022
@arnaldo2792 arnaldo2792 deleted the kmod-5.15-nvidia branch October 26, 2022 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants