Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NVIDIA tools dependencies #2090

Merged

Conversation

arnaldo2792
Copy link
Contributor

Issue number:
N / A

Description of changes:
In preparation for #1074, we need to fix nvidia-container-toolkit's dependencies so that it doesn't pull down nvidia-k8s-device-plugin

aws-k8s-1.21-nvidia,aws-k8s-1.22-nvidia: fix NVIDIA tools dependencies

`nvidia-container-toolkit` doesn't depend on `nvidia-k8s-device-plugin`,
and vice versa, they are independent of each other. Thus, they should be
included separately in aws-k8s-*-nvidia images.

Testing done:

  • Build aws-k8s-1.21-nvidia/aws-k8s-1.22-nvidia
  • Run a daemonset with the following configuration, and validated I can use nvidia-smi:
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-tests
spec:
  selector:
    matchLabels:
      name: gpu-tests
  template:
    metadata:
      labels:
        name: gpu-tests
    spec:
      containers:
        - name: gpu-tests
          image: amazonlinux:2
          command: ['sh', '-c', 'sleep infinity']
          resources:
            limits:
               nvidia.com/gpu: 1
          env:
            - name: NVIDIA_DRIVER_CAPABILITIES
              value: all

Calls to nvidia-smi:

Pod: gpu-tests-m8lzj
Wed Apr 20 23:48:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   26C    P0    22W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Pod: gpu-tests-w5sd5
Wed Apr 20 23:48:43 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   27C    P0    22W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I validated nvidia-k8s-device-plugin.service is running in both versions:

bash-5.1# apiclient get os
{
  "os": {
    "arch": "x86_64",
    "build_id": "f3a9afc8",
    "pretty_name": "Bottlerocket OS 1.7.1 (aws-k8s-1.22-nvidia)",
    "variant_id": "aws-k8s-1.22-nvidia",
    "version_id": "1.7.1"
  }
}
bash-5.1# systemctl status nvidia-k8s-device-plugin.service | head -n 10
● nvidia-k8s-device-plugin.service - Start NVIDIA kubernetes device plugin
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-k8s-device-plugin.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-04-20 23:24:53 UTC; 34min ago
   Main PID: 3017 (nvidia-device-p)
      Tasks: 10 (limit: 73527)
     Memory: 16.7M
     CGroup: /system.slice/nvidia-k8s-device-plugin.service
             └─3017 /usr/bin/nvidia-device-plugin

Apr 20 23:24:53 ip-192-168-38-175.us-west-2.compute.internal systemd[1]: Started Start NVIDIA kubernetes device plugin.
bash-5.1# apiclient get os
{
  "os": {
    "arch": "x86_64",
    "build_id": "f3a9afc8",
    "pretty_name": "Bottlerocket OS 1.7.1 (aws-k8s-1.21-nvidia)",
    "variant_id": "aws-k8s-1.21-nvidia",
    "version_id": "1.7.1"
  }
}
bash-5.1# systemctl status nvidia-k8s-device-plugin.service | head -n 10
● nvidia-k8s-device-plugin.service - Start NVIDIA kubernetes device plugin
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-k8s-device-plugin.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-04-20 23:25:26 UTC; 34min ago
   Main PID: 3172 (nvidia-device-p)
      Tasks: 9 (limit: 73527)
     Memory: 15.8M
     CGroup: /system.slice/nvidia-k8s-device-plugin.service
             └─3172 /usr/bin/nvidia-device-plugin

Apr 20 23:25:26 ip-192-168-15-27.us-west-2.compute.internal systemd[1]: Started Start NVIDIA kubernetes device plugin.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

`nvidia-container-toolkit` doesn't depend on `nvidia-k8s-device-plugin`,
and vice versa, they are independent of each other. Thus, they should be
included separately in aws-k8s-*-nvidia images.

Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
@@ -20,7 +20,6 @@ Source3: nvidia-oci-hooks-json
BuildRequires: %{_cross_os}glibc-devel
Requires: %{_cross_os}libnvidia-container
Requires: %{_cross_os}shimpei
Requires: %{_cross_os}nvidia-k8s-device-plugin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're removing this dependency but should it be reversed? Does nvidia-k8s-device-plugin work as expected if nvidia-container-toolkit is not installed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, nvidia-k8s-device-plugin doesn't require nvidia-container-toolkit to work properly, it won't try to call any binary provided by nvidia-container-toolkit and it only registers itself with the kubelet.

@arnaldo2792 arnaldo2792 merged commit 86d982b into bottlerocket-os:develop Apr 22, 2022
@arnaldo2792 arnaldo2792 deleted the fix-nvidia-dependencies branch April 22, 2022 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants