Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add aws-ecs-1-nvidia variant #2128

Merged

Conversation

arnaldo2792
Copy link
Contributor

@arnaldo2792 arnaldo2792 commented May 3, 2022

Issue number:
Closes #1074

Description of changes:

variants: add aws-ecs-1-nvidia

Testing done:

In my ECS cluster, I created two daemon services, as follows:

  • Service of a task with container.resourceRequirement=GPU
  • Service of a task without GPU requirements

I validated both tasks were scheduled in the new variant, for both x86_64/aarch64:

bash-5.1# apiclient get os
{
  "os": {
    "arch": "x86_64",in the new variant, for both x86_64/aarch64:
    "build_id": "06599241",
    "pretty_name": "Bottlerocket OS 1.7.2 (aws-ecs-1-nvidia)",
    "variant_id": "aws-ecs-1-nvidia",
    "version_id": "1.7.2"
  }
}
bash-5.1# docker ps
CONTAINER ID   IMAGE                                                       COMMAND            CREATED       STATUS       PORTS     NAMES
0b1362b65437   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy  "sleep infinity"   4 hours ago   Up 4 hours
50ec3f736556   fedora:35                                                   "sleep infinity"   4 hours ago   Up 4 hours

# Container with GPU requirement
bash-5.1# docker exec 0b1362b65437 nvidia-smi
Tue May  3 21:20:09 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P0    36W / 300W |      0MiB / 16160MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# Contianer without GPU requirement fails to run `nvidia-smi`
bash-5.1# docker exec 50ec3f736556 nvidia-smi
OCI runtime exec failed: exec failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown

bash-5.1# docker exec 50ec3f736556 uname -a
Linux 50ec3f736556 5.10.109 #1 SMP Tue May 3 04:06:34 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
bash-5.1# apiclient get os
{
  "os": {
    "arch": "aarch64",
    "build_id": "06599241",
    "pretty_name": "Bottlerocket OS 1.7.2 (aws-ecs-1-nvidia)",
    "variant_id": "aws-ecs-1-nvidia",
    "version_id": "1.7.2"
  }
}

bash-5.1# docker ps
CONTAINER ID   IMAGE                                                       COMMAND            CREATED       STATUS       PORTS     NAMES
9ce1fa7ca8df   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy   "sleep infinity"   2 hours ago   Up 2 hours
c49549d07ce9   fedora:35                                                   "sleep infinity"   2 hours ago   Up 2 hours

# Container with GPU requirement
bash-5.1# docker exec 9ce1fa7ca8df nvidia-smi
Tue May  3 21:18:05 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T4G          Off  | 00000000:00:1F.0 Off |                    0 |
| N/A   52C    P0    16W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# Container without GPU requirement
bash-5.1# docker exec c49549d07ce9 nvidia-smi
OCI runtime exec failed: exec failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown
bash-5.1# docker exec c49549d07ce9 uname -a
Linux c49549d07ce9 5.10.109 #1 SMP Tue May 3 17:58:09 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

I validated only the task without the GPU requirement was scheduled in the existing variant, for both x86_64:

bash-5.1# apiclient get os
{
  "os": {
    "arch": "x86_64",
    "build_id": "06599241",
    "pretty_name": "Bottlerocket OS 1.7.2 (aws-ecs-1)",
    "variant_id": "aws-ecs-1",
    "version_id": "1.7.2"
  }
}
bash-5.1# docker ps
CONTAINER ID   IMAGE       COMMAND            CREATED         STATUS         PORTS     NAMES
cef9aaee0508   fedora:35   "sleep infinity"   6 minutes ago   Up 6 minutes             
bash-5.1# docker exec cef9aaee0508 uname -a
Linux cef9aaee0508 5.10.109 #1 SMP Tue May 3 04:06:34 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@arnaldo2792
Copy link
Contributor Author

(Forced push adds documentation for the variant)

@arnaldo2792
Copy link
Contributor Author

(Forced push fixes link and JSON blob in documentation)

Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
@arnaldo2792
Copy link
Contributor Author

(Forced push fixes JSON blob in documentation)

Copy link
Contributor

@bcressey bcressey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM.

@@ -0,0 +1 @@
../../../shared-defaults/docker-services.toml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has four parts:

  1. docker service definition + restart command
  2. docker daemon config file
  3. container registry mirrors metadata
  4. container registry credentials metadata

We end up overriding (2) in 53-docker-daemon.toml, and (4) in 52-aws-ecs-1.toml.

It works in the sense that we're ending up with the right values, but I wonder if there's a way to refactor this that's easier to follow.

If nothing comes to mind, I'm OK with this going in. However, it's close to the ceiling of acceptable complexity because it makes migrations in this area even harder to reason about than usual.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to propose something to safely improve these configuration files 👍

@@ -1,5 +1,6 @@
%global _cross_first_party 1
%global _is_k8s_variant %(if echo %{_cross_variant} | grep -Fqw "k8s"; then echo 1; else echo 0; fi)
%global _is_ecs_variant %(if echo %{_cross_variant} | grep -Fqw "ecs"; then echo 1; else echo 0; fi)
Copy link
Contributor

@etungsten etungsten May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking but I'm slightly concerned about expanding this since users who create custom variants with names that contain any of these substrings will accidentally include packages they might not need. I wonder if we should tokenize the variant tuple into fields with awk then checking the fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! Maybe we should have shared macros with your idea so that we don't increase the verbosity here.

@arnaldo2792 arnaldo2792 merged commit 6a82bc5 into bottlerocket-os:develop May 5, 2022
@arnaldo2792 arnaldo2792 deleted the nvidia-ecs-variant branch June 7, 2022 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ECS] GPU support
3 participants