Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix aws-ecs-1-nvidia configurations #2167

Merged

Conversation

arnaldo2792
Copy link
Contributor

@arnaldo2792 arnaldo2792 commented Jun 2, 2022

Issue number:
N / A

Description of changes:

aws-ecs-1-nvidia: add netdog configurations to cmdline

This adds new boot configurations for netdog to prepare the primary
network interface
docker-engine: fix daemon configuration for NVIDIA

The default runtime for all ECS variants should be `shimpie`, since the
ecs-agent knows when to switch runtimes depending on the task's
configurations

Testing done:
Using the new variant, I scheduled two tasks, one with 1 GPU. I verified the two tasks were scheduled and that only the task with the GPU had access to nvidia-smi:

bash-5.1# docker ps
CONTAINER ID   IMAGE                                                       COMMAND            CREATED              STATUS              PORTS     NAMES
2be3a1aa4a4d   fedora:35                                                   "sleep infinity"   About a minute ago   Up About a minute             ecs-fedora-2-fedora-b2b8dc9a8ee7fcc9e301
2edadda4084c   <>                                                          "sleep infinity"   2 minutes ago        Up 2 minutes                  ecs-nvidia-3-nvidia-cac8f5e5ea96bdc15f00
bash-5.1# docker exec 2be3a1aa4a4d nvidia-smi
OCI runtime exec failed: exec failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown
bash-5.1# docker exec 2edadda4084c nvidia-smi
Thu Jun  2 01:42:54 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   35C    P0    50W / 300W |      0MiB / 16160MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

The default runtime for all ECS variants should be `shimpie`, since the
ecs-agent knows when to switch runtimes depending on the task's
configurations

Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
@arnaldo2792
Copy link
Contributor Author

arnaldo2792 commented Jun 3, 2022

(forced push to pull down netdog changes that weren't part of the ECS variant)

This adds new boot configurations for netdog to prepare the primary
network interface

Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
@arnaldo2792 arnaldo2792 changed the title Fix daemon configuration for NVIDIA Fix aws-ecs-1-nvidia configurations Jun 3, 2022
@arnaldo2792
Copy link
Contributor Author

Push adds commit for missing netdog configurations in the aws-ecs-1-nvidia variant

@arnaldo2792 arnaldo2792 merged commit 2fb66bd into bottlerocket-os:develop Jun 3, 2022
@arnaldo2792 arnaldo2792 deleted the fix-ecs-nvidia-variant branch June 7, 2022 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants