Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix how configurations files are created for the nvidia-container-toolkit #3790

Merged

Conversation

arnaldo2792
Copy link
Contributor

Issue number:

N / A

Description of changes:

In 911775f, we changed how the configuration files for the nvidia-container-toolkit were created, and used %post install scripts to create hard links to the correspondent configuration per variant. However, the hard links weren't created since the lua scripts use relative paths instead of absolute paths and the builds didn't fail since %post install scripts fail silently.

With this commit, instead of creating hard links with %post install scripts, the configuration files for the nvidia-container-toolkit are copied over with tmpfiles.d configuration files.

Testing done:

  • In k8s 1.29 x86_64, I verified the containerized workload doesn't have access to all the GPUs when NVIDIA_VISIBLE_DEVICES=all:
~ on Fedora ❯ k describe pod gpu-tests-2-thz4w | rg NVIDIA_VISIBLE_DEVICES
      NVIDIA_VISIBLE_DEVICES:      all
# There are 4 GPUs in the instance, and only two were configured as request in the Daemonset spec
~ on Fedora ❯ k exec gpu-tests-2-thz4w -- nvidia-smi
Wed Feb 21 02:30:31 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1C.0 Off |                    0 |
| N/A   17C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
| N/A   18C    P8               8W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
  • in aws-ecs-2-nvidia x86_64, I confirmed the configuration file was copied over which wasn't when the %post install script was used:
bash-5.1# apiclient get os
{
  "os": {
    "arch": "x86_64",
    "build_id": "08dfbd2b",
    "pretty_name": "Bottlerocket OS 1.19.2 (aws-ecs-2-nvidia)",
    "variant_id": "aws-ecs-2-nvidia",
    "version_id": "1.19.2"
  }
}
bash-5.1# ls /etc/nvidia-container-runtime/config.toml
/etc/nvidia-container-runtime/config.toml

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

In 911775f, we changed how the configuration files for the
nvidia-container-toolkit were created, and used %post install scripts to
create hard links to the correspondent configuration per variant.
However, the hard links weren't created since the lua scripts use
relative paths instead of absolute paths and the builds didn't fail
since %post install scripts fail silently.

With this commit, instead of creating hard links with %post install
scripts, the configuration files for the nvidia-container-toolkit are
copied over with tmpfiles.d configuration files.

Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
@arnaldo2792 arnaldo2792 merged commit 3898ba5 into bottlerocket-os:develop Feb 21, 2024
50 checks passed
@arnaldo2792 arnaldo2792 deleted the fix-nvidia-visible-devices branch February 21, 2024 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants