Add support for cgroup v2 to NVIDIA variants #2802

markusboehme · 2023-02-14T15:18:42Z

Issue number:

Closes #2504

Description of changes:

Make the NVIDIA variants of Bottlerocket work on hosts with cgroup v2 enabled ("unified cgroup hierarchy"). For this

build libnvidia-container with its Go library that can handle both cgroup v1 and cgroup v2 (libnvidia-container-go, also nvcgo)
fix up a previous downstream patch to also create symlinks for the new Go library (which wasn't around when the patch was introduced, and wasn't yet needed when the library became available with a dependency update)
disable eBPF JIT hardening for privileged users on NVIDIA variants, see the commit for an explanation why and Investigate what is required to support cgroups v2 in libnvidia-container #2504 for debugging notes

Testing done:

Build an aws-k8s-1.24-nvidia AMI to be used in a Kubernetes cluster with a g3.16xlarge EC2 instance (x86_64, 4x Tesla M60 GPUs)
Create a pod with 4 GPUs assigned (manifest see below)
Run kubectl exec test -- nvidia-smi to verify the container is running, verify all 4 GPUs show up and report data
Enable cgroup v2 on the host (needs a reboot): apiclient set -j '{"settings": {"boot": {"init": {"systemd.unified_cgroup_hierarchy": ["1"]}}}}' followed by apiclient reboot
Wait for reboot
Run kubectl exec test -- nvidia-smi to verify the container is running, and verify all 4 GPUs show up and report data

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Signed-off-by: Markus Boehme <[email protected]>

markusboehme · 2023-02-14T16:04:40Z

Will take a look at those aarch64 build failures. Converting to draft for now.

Uses the new libnvidia-container-go library. Since this a Go library, set up proper environment variables for Go cross compilation that were not needed before. Signed-off-by: Markus Boehme <[email protected]>

On a host using cgroup v2 access to the GPU is governed via cgroup devices eBPF programs, i.e. eBPF programs attached to a cgroup. To grant access to a GPU, libnvidia-container adds an eBPF program that allows processes inside a cgroup access to the desired GPU. If a cgroup devices program is already attached to a cgroup or access to multiple GPUs is to be allowed, libnvidia-container reads the existing program from the kernel, modifies it, and loads the modified program. Commit 0508365 ("release: Add several security-related sysctls") followed the recommendations by the Kernel Self Protection Project and set the `net.core.bpf_jit_harden` sysctl with value `2`, i.e. started applying JIT hardening to all eBPF programs being loaded. Constant blinding is one such hardening measure and modifies the eBPF byte code. Programs that underwent such a hardening pass cannot be read back by user space. As a consequence of the sysctl setting, the read/modify/write cycle performed by libnvidia-container fails on a Bottlerocket host. Containers with a GPU cannot be launched if the host uses cgroup v2. Fix this by only applying eBPF JIT hardening to programs loaded by unprivileged users, i.e. setting the `net.core.bpf_jit_harden` sysctl to a value of `1`. Other sysctls may be tweaked to achieve a similar outcome (`kernel.kptr_restrict`, `kernel.perf_event_paranoid`) and allow reading back blinded eBPF programs, but would otherwise lead to a weaker security posture and hence are undesirable options. This revised sysctl setting only applies to `nvidia` variants. All other variants retain the current behavior of applying JIT hardening for all users. Strictly speaking, it only is necessary to lower this sysctl's value on hosts with cgroup v2 enabled. However, cgroup v2 will become the default soon. It does not seem worthwhile to introduce additional complexity to `corndog` to apply the setting selectively. Note that by default, Bottlerocket does not permit unprivileged users to load eBPF programs. Therefore, the change effectively disables JIT hardening for `nvidia` variants. The value is lowered to `1` instead of disabled completely as a defense in depth mechanism and to guard against potential future changes. Signed-off-by: Markus Boehme <[email protected]>

markusboehme · 2023-02-14T22:22:41Z

Was missing the proper setup to make cross compilation of the newly added Go library work, hence the aarch64 builds failed before on the x86_64 builders. Fixed this and successfully redid my testing with an x86_64 AMI that was built on an aarch64 machine (naturally, cross compilation failed either direction, though in different ways).

markusboehme · 2023-02-15T11:31:01Z

One build got unlucky twice in a row: First another build job stole the show, then a request to fetch dependencies failed. Third time was a charm.

stmcginnis

Looks good. Did some testing of aws-k8s-1.23-nvidia on g5g.2xlarge instances.

Built image with changes and published AMI. Created EKS cluster.

Created an NVIDIA workload test image. Ran with default booted nodes:

...
=========================================
  Running sample vectorAdd
=========================================

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

=========================================
  Running sample warpAggregatedAtomicsCG
=========================================

GPU Device 0: "Turing" with compute capability 7.5

CPU max matches GPU max

Warp Aggregated Atomics PASSED

$ k get pods
NAME            READY   STATUS      RESTARTS   AGE
nvsmoke-wj4cz   0/1     Completed   0          48s

Enabled per command: apiclient set -j '{"settings": {"boot": {"init": {"systemd.unified_cgroup_hierarchy": ["1"]}}}}' and rebooted node.

Created new job to rerun workload test:

...
=========================================
  Running sample vectorAdd
=========================================

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

=========================================
  Running sample warpAggregatedAtomicsCG
=========================================

GPU Device 0: "Turing" with compute capability 7.5

CPU max matches GPU max

Warp Aggregated Atomics PASSED

$ k get pods
NAME            READY   STATUS      RESTARTS   AGE
nvsmoke-whmjs   0/1     Completed   0          4m31s

markusboehme · 2023-02-15T22:45:26Z

Wonderful, thank you Sean! I need to play around with that workload test myself.

arnaldo2792

Very nice troubleshooting, and very insightful (I'm still learning eBPF!) 🎉

libnvidia-container: fix .so symlinks when compiling WITH_NVCGO=yes

9fa7d3c

Signed-off-by: Markus Boehme <[email protected]>

markusboehme requested review from bcressey, arnaldo2792 and foersleo February 14, 2023 15:18

markusboehme added 2 commits February 14, 2023 22:18

libnvidia-container: build with cgroup v2 support

4e3a3c1

Uses the new libnvidia-container-go library. Since this a Go library, set up proper environment variables for Go cross compilation that were not needed before. Signed-off-by: Markus Boehme <[email protected]>

markusboehme force-pushed the feature/nvidia-on-cgroup2 branch from d4b865c to f0c0458 Compare February 14, 2023 22:20

stmcginnis approved these changes Feb 15, 2023

View reviewed changes

foersleo approved these changes Feb 16, 2023

View reviewed changes

arnaldo2792 approved these changes Feb 16, 2023

View reviewed changes

markusboehme merged commit ed063a0 into bottlerocket-os:develop Feb 17, 2023

markusboehme mentioned this pull request Mar 9, 2023

Support and use cgroup v2 #2843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for cgroup v2 to NVIDIA variants #2802

Add support for cgroup v2 to NVIDIA variants #2802

markusboehme commented Feb 14, 2023

markusboehme commented Feb 14, 2023

markusboehme commented Feb 14, 2023

markusboehme commented Feb 15, 2023

stmcginnis left a comment

markusboehme commented Feb 15, 2023

arnaldo2792 left a comment

Add support for cgroup v2 to NVIDIA variants #2802

Add support for cgroup v2 to NVIDIA variants #2802

Conversation

markusboehme commented Feb 14, 2023

markusboehme commented Feb 14, 2023

markusboehme commented Feb 14, 2023

markusboehme commented Feb 15, 2023

stmcginnis left a comment

Choose a reason for hiding this comment

markusboehme commented Feb 15, 2023

arnaldo2792 left a comment

Choose a reason for hiding this comment