New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

kmod-5.10-nvidia: add remaining libraries #1928

Merged

arnaldo2792 merged 2 commits into bottlerocket-os:develop from arnaldo2792:nvidia-integration

Feb 4, 2022

Contributor

arnaldo2792 commented Jan 26, 2022 •

edited

Loading

Issue number:
Closes #1822

Description of changes:

kmod-5.10-nvidia: add remaining libraries
kmod-5.10-nvidia: add releases url

The NVIDIA sources provide user-space libraries that will be mounted into the containers, depending on the set of driver capabilities configured for the workload.

Testing done:

I ran a daemonset in a p3.2xlarge instance with the following image definition:

FROM nvidia/cuda:11.4.3-devel-ubuntu20.04 as cuda-samples

RUN apt update
RUN apt install git build-essential -y
RUN git clone https://github.com/NVIDIA/cuda-samples.git

# Compute samples
RUN mkdir -p /samples
RUN cd /cuda-samples/Samples/0_Introduction/vectorAdd && make -j && [ -f vectorAdd ] && cp vectorAdd /samples/
RUN cd /cuda-samples/Samples/1_Utilities/bandwidthTest && make -j && [ -f bandwidthTest ] && cp bandwidthTest /samples/
RUN cd /cuda-samples/Samples/1_Utilities/deviceQuery && make -j && [ -f deviceQuery ] && cp deviceQuery /samples/
RUN cd /cuda-samples/Samples/1_Utilities/topologyQuery && make -j && [ -f topologyQuery ] && cp topologyQuery /samples/

FROM alpine as builder
RUN apk update \
  && apk add --update git

FROM builder as benchmarks
RUN git clone https://github.com/tensorflow/benchmarks.git \
  && cd benchmarks \
  && git checkout cnn_tf_v1.15_compatible

FROM tensorflow/tensorflow:1.15.2-gpu
ENV SAMPLES="vectorAdd bandwidthTest deviceQuery topologyQuery"
COPY ./entrypoint.sh /
COPY --from=benchmarks /benchmarks /opt/benchmarks
COPY --from=cuda-samples /samples/* /usr/bin/
RUN chmod +x ./entrypoint.sh && mkdir -p /opt
ENTRYPOINT ["sh", "-c", "/entrypoint.sh"]

Entrypoint:

#! /usr/bin/env bash

# Cuda samples:

for sample in $SAMPLES; do
  $sample
done

# GPU benchmark:
python3 /opt/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
  --data_name=imagenet                                                 \
  --model=resnet50                                                     \
  --num_batches=100                                                    \
  --batch_size=4                                                       \
  --num_gpus=1                                                         \
  --gpu_memory_frac_for_testing=0.2

The containers ran successfully.

TODO:

Same test for aarch64

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.


          kmod-5.10-nvidia: add releases url

f440bc2

arnaldo2792 requested review from bcressey and jpculp

January 26, 2022 23:01

arnaldo2792 marked this pull request as ready for review

January 28, 2022 01:43

jpculp approved these changes

View reviewed changes

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated

    
              # https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#driver-capabilities

              # Utility libs

              install -m755 libnvidia-ml.so.%{nvidia_tesla_470_version} %{buildroot}%{_cross_libdir}/nvidia/tesla/%{nvidia_tesla_470_version}

Member

jpculp Jan 28, 2022

Suggested change

      
            install -m755 libnvidia-ml.so.%{nvidia_tesla_470_version} %{buildroot}%{_cross_libdir}/nvidia/tesla/%{nvidia_tesla_470_version}
          
            install -m 755 libnvidia-ml.so.%{nvidia_tesla_470_version} %{buildroot}%{_cross_libdir}/nvidia/tesla/%{nvidia_tesla_470_version}

nit. You might want to add a space to be consistent with the preexisting code, but it seems to work either way.

bcressey reviewed

View reviewed changes

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

arnaldo2792 force-pushed the nvidia-integration branch from 5d36501 to e327582 Compare

February 2, 2022 03:12

Contributor Author

arnaldo2792 commented Feb 2, 2022

Forced push includes:

Install all libraries, and explicitly include and exclude the libraries in the %files section
Short global variable nvidia_tesla_470_version
Create only the required symlinks, based on the output of readelf -a <lib> | grep SONAME

bcressey reviewed

View reviewed changes

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

etungsten approved these changes

View reviewed changes

Contributor

etungsten left a comment

I tested both aarch64 and x86_64 builds with benchmarks and samples and they pass.

arnaldo2792 force-pushed the nvidia-integration branch from e327582 to bd02999 Compare

February 3, 2022 00:09

Contributor Author

arnaldo2792 commented Feb 3, 2022

Forced push includes:

Automated symlink creation
Explanation on why some libraries are excluded

arnaldo2792 commented

View reviewed changes

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

bcressey reviewed

View reviewed changes

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

arnaldo2792 force-pushed the nvidia-integration branch from bd02999 to d1213c8 Compare

February 3, 2022 03:11

Contributor Author

arnaldo2792 commented Feb 3, 2022

Forced push includes:

Remove compat32 libs

jpculp approved these changes

View reviewed changes

bcressey reviewed

View reviewed changes

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Show resolved Hide resolved

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved

Contributor

bcressey commented Feb 3, 2022

Still trying to come up with better advice for what to include, since our current method seems pretty high touch and error prone.

I'd like to err on the side of including everything and letting libnvidia-container sort it out, with the possible exception of the Gtk and Wayland stuff that we know is excluded.

It seems like the only problem with that plan is what to do about the libEGL.so.1 symlink, which should point to one of the two libraries with that SONAME, and perhaps one of them should be excluded.

Contributor

bcressey commented Feb 3, 2022

It seems like the only problem with that plan is what to do about the libEGL.so.1 symlink, which should point to one of the two libraries with that SONAME, and perhaps one of them should be excluded.

Let's point libEGL.so.1 to libEGL.so.1.1.0 since that seems to be how the other libglvnd libraries are treated. We can still include both. On a running instance afterwards, you can check to see whether ldconfig --print-cache agrees with that resolution.

If we could get the output from libnvidia-container when it's checking the compat libraries for inclusion, that might help determine whether either or both of them is OK. Or check how the other driver container images handle this.


          kmod-5.10-nvidia: add remaining libraries

68dbe7a

The NVIDIA sources provide user-space libraries that will be mounted
into the containers, depending on the set of driver capabilities
configured for the workload.

Signed-off-by: Arnaldo Garcia Rincon <[email protected]>

arnaldo2792 force-pushed the nvidia-integration branch from d1213c8 to 68dbe7a Compare

February 3, 2022 23:18

Contributor Author

arnaldo2792 commented Feb 3, 2022

If we could get the output from libnvidia-container when it's checking the compat libraries for inclusion, that might help determine whether either or both of them is OK. Or check how the other driver container images handle this.

libnvidia-container will complain when a library is missing, with a message like this, visible in the journal:

missing <compat32> library: <library>

With these changes, I didn't see any complains:

bash-5.0# uname -a
Linux ip-192-168-74-162.us-west-2.compute.internal 5.10.93 #1 SMP Wed Jan 26 19:56:51 UTC 2022 x86_64 GNU/Linux
bash-5.0# journalctl | grep missing
Feb 03 23:12:00 ip-192-168-74-162.us-west-2.compute.internal kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel

bash-5.0# uname -a
Linux ip-192-168-78-160.us-west-2.compute.internal 5.10.93 #1 SMP Wed Jan 26 19:54:26 UTC 2022 aarch64 GNU/Linux
bash-5.0# journalctl | grep missing
Feb 03 23:12:20 ip-192-168-78-160.us-west-2.compute.internal kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel

bcressey approved these changes

View reviewed changes

arnaldo2792 merged commit 0440bcb into bottlerocket-os:develop

arnaldo2792 deleted the nvidia-integration branch

March 31, 2022 20:55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet