-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for cgroup v2 to NVIDIA variants #2802
Add support for cgroup v2 to NVIDIA variants #2802
Conversation
Signed-off-by: Markus Boehme <[email protected]>
Will take a look at those aarch64 build failures. Converting to draft for now. |
Uses the new libnvidia-container-go library. Since this a Go library, set up proper environment variables for Go cross compilation that were not needed before. Signed-off-by: Markus Boehme <[email protected]>
On a host using cgroup v2 access to the GPU is governed via cgroup devices eBPF programs, i.e. eBPF programs attached to a cgroup. To grant access to a GPU, libnvidia-container adds an eBPF program that allows processes inside a cgroup access to the desired GPU. If a cgroup devices program is already attached to a cgroup or access to multiple GPUs is to be allowed, libnvidia-container reads the existing program from the kernel, modifies it, and loads the modified program. Commit 0508365 ("release: Add several security-related sysctls") followed the recommendations by the Kernel Self Protection Project and set the `net.core.bpf_jit_harden` sysctl with value `2`, i.e. started applying JIT hardening to all eBPF programs being loaded. Constant blinding is one such hardening measure and modifies the eBPF byte code. Programs that underwent such a hardening pass cannot be read back by user space. As a consequence of the sysctl setting, the read/modify/write cycle performed by libnvidia-container fails on a Bottlerocket host. Containers with a GPU cannot be launched if the host uses cgroup v2. Fix this by only applying eBPF JIT hardening to programs loaded by unprivileged users, i.e. setting the `net.core.bpf_jit_harden` sysctl to a value of `1`. Other sysctls may be tweaked to achieve a similar outcome (`kernel.kptr_restrict`, `kernel.perf_event_paranoid`) and allow reading back blinded eBPF programs, but would otherwise lead to a weaker security posture and hence are undesirable options. This revised sysctl setting only applies to `nvidia` variants. All other variants retain the current behavior of applying JIT hardening for all users. Strictly speaking, it only is necessary to lower this sysctl's value on hosts with cgroup v2 enabled. However, cgroup v2 will become the default soon. It does not seem worthwhile to introduce additional complexity to `corndog` to apply the setting selectively. Note that by default, Bottlerocket does not permit unprivileged users to load eBPF programs. Therefore, the change effectively disables JIT hardening for `nvidia` variants. The value is lowered to `1` instead of disabled completely as a defense in depth mechanism and to guard against potential future changes. Signed-off-by: Markus Boehme <[email protected]>
d4b865c
to
f0c0458
Compare
Was missing the proper setup to make cross compilation of the newly added Go library work, hence the aarch64 builds failed before on the x86_64 builders. Fixed this and successfully redid my testing with an x86_64 AMI that was built on an aarch64 machine (naturally, cross compilation failed either direction, though in different ways). |
One build got unlucky twice in a row: First another build job stole the show, then a request to fetch dependencies failed. Third time was a charm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Did some testing of aws-k8s-1.23-nvidia on g5g.2xlarge
instances.
Built image with changes and published AMI. Created EKS cluster.
Created an NVIDIA workload test image. Ran with default booted nodes:
...
=========================================
Running sample vectorAdd
=========================================
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
=========================================
Running sample warpAggregatedAtomicsCG
=========================================
GPU Device 0: "Turing" with compute capability 7.5
CPU max matches GPU max
Warp Aggregated Atomics PASSED
$ k get pods
NAME READY STATUS RESTARTS AGE
nvsmoke-wj4cz 0/1 Completed 0 48s
Enabled per command: apiclient set -j '{"settings": {"boot": {"init": {"systemd.unified_cgroup_hierarchy": ["1"]}}}}'
and rebooted node.
Created new job to rerun workload test:
...
=========================================
Running sample vectorAdd
=========================================
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
=========================================
Running sample warpAggregatedAtomicsCG
=========================================
GPU Device 0: "Turing" with compute capability 7.5
CPU max matches GPU max
Warp Aggregated Atomics PASSED
$ k get pods
NAME READY STATUS RESTARTS AGE
nvsmoke-whmjs 0/1 Completed 0 4m31s
Wonderful, thank you Sean! I need to play around with that workload test myself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice troubleshooting, and very insightful (I'm still learning eBPF!) 🎉
Issue number:
Closes #2504
Description of changes:
Make the NVIDIA variants of Bottlerocket work on hosts with cgroup v2 enabled ("unified cgroup hierarchy"). For this
libnvidia-container
with its Go library that can handle both cgroup v1 and cgroup v2 (libnvidia-container-go
, alsonvcgo
)Testing done:
aws-k8s-1.24-nvidia
AMI to be used in a Kubernetes cluster with a g3.16xlarge EC2 instance (x86_64, 4x Tesla M60 GPUs)kubectl exec test -- nvidia-smi
to verify the container is running, verify all 4 GPUs show up and report dataapiclient set -j '{"settings": {"boot": {"init": {"systemd.unified_cgroup_hierarchy": ["1"]}}}}'
followed byapiclient reboot
kubectl exec test -- nvidia-smi
to verify the container is running, and verify all 4 GPUs show up and report dataTerms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.