Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for cgroup v2 to NVIDIA variants #2802

Merged

Conversation

markusboehme
Copy link
Member

Issue number:

Closes #2504

Description of changes:

Make the NVIDIA variants of Bottlerocket work on hosts with cgroup v2 enabled ("unified cgroup hierarchy"). For this

  • build libnvidia-container with its Go library that can handle both cgroup v1 and cgroup v2 (libnvidia-container-go, also nvcgo)
  • fix up a previous downstream patch to also create symlinks for the new Go library (which wasn't around when the patch was introduced, and wasn't yet needed when the library became available with a dependency update)
  • disable eBPF JIT hardening for privileged users on NVIDIA variants, see the commit for an explanation why and Investigate what is required to support cgroups v2 in libnvidia-container #2504 for debugging notes

Testing done:

  • Build an aws-k8s-1.24-nvidia AMI to be used in a Kubernetes cluster with a g3.16xlarge EC2 instance (x86_64, 4x Tesla M60 GPUs)
  • Create a pod with 4 GPUs assigned (manifest see below)
  • Run kubectl exec test -- nvidia-smi to verify the container is running, verify all 4 GPUs show up and report data
  • Enable cgroup v2 on the host (needs a reboot): apiclient set -j '{"settings": {"boot": {"init": {"systemd.unified_cgroup_hierarchy": ["1"]}}}}' followed by apiclient reboot
  • Wait for reboot
  • Run kubectl exec test -- nvidia-smi to verify the container is running, and verify all 4 GPUs show up and report data

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@markusboehme
Copy link
Member Author

Will take a look at those aarch64 build failures. Converting to draft for now.

Uses the new libnvidia-container-go library. Since this a Go library,
set up proper environment variables for Go cross compilation that were
not needed before.

Signed-off-by: Markus Boehme <[email protected]>
On a host using cgroup v2 access to the GPU is governed via cgroup
devices eBPF programs, i.e. eBPF programs attached to a cgroup. To grant
access to a GPU, libnvidia-container adds an eBPF program that allows
processes inside a cgroup access to the desired GPU.

If a cgroup devices program is already attached to a cgroup or access to
multiple GPUs is to be allowed, libnvidia-container reads the existing
program from the kernel, modifies it, and loads the modified program.

Commit 0508365 ("release: Add several security-related sysctls")
followed the recommendations by the Kernel Self Protection Project and
set the `net.core.bpf_jit_harden` sysctl with value `2`, i.e. started
applying JIT hardening to all eBPF programs being loaded. Constant
blinding is one such hardening measure and modifies the eBPF byte code.
Programs that underwent such a hardening pass cannot be read back by
user space.

As a consequence of the sysctl setting, the read/modify/write cycle
performed by libnvidia-container fails on a Bottlerocket host.
Containers with a GPU cannot be launched if the host uses cgroup v2.

Fix this by only applying eBPF JIT hardening to programs loaded by
unprivileged users, i.e. setting the `net.core.bpf_jit_harden` sysctl to
a value of `1`. Other sysctls may be tweaked to achieve a similar
outcome (`kernel.kptr_restrict`, `kernel.perf_event_paranoid`) and allow
reading back blinded eBPF programs, but would otherwise lead to a weaker
security posture and hence are undesirable options.

This revised sysctl setting only applies to `nvidia` variants. All other
variants retain the current behavior of applying JIT hardening for all
users. Strictly speaking, it only is necessary to lower this sysctl's
value on hosts with cgroup v2 enabled. However, cgroup v2 will become
the default soon. It does not seem worthwhile to introduce additional
complexity to `corndog` to apply the setting selectively.

Note that by default, Bottlerocket does not permit unprivileged users to
load eBPF programs. Therefore, the change effectively disables JIT
hardening for `nvidia` variants. The value is lowered to `1` instead of
disabled completely as a defense in depth mechanism and to guard against
potential future changes.

Signed-off-by: Markus Boehme <[email protected]>
@markusboehme
Copy link
Member Author

Was missing the proper setup to make cross compilation of the newly added Go library work, hence the aarch64 builds failed before on the x86_64 builders. Fixed this and successfully redid my testing with an x86_64 AMI that was built on an aarch64 machine (naturally, cross compilation failed either direction, though in different ways).

@markusboehme
Copy link
Member Author

One build got unlucky twice in a row: First another build job stole the show, then a request to fetch dependencies failed. Third time was a charm.

Copy link
Contributor

@stmcginnis stmcginnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Did some testing of aws-k8s-1.23-nvidia on g5g.2xlarge instances.

Built image with changes and published AMI. Created EKS cluster.

Created an NVIDIA workload test image. Ran with default booted nodes:

...
=========================================
  Running sample vectorAdd
=========================================

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

=========================================
  Running sample warpAggregatedAtomicsCG
=========================================

GPU Device 0: "Turing" with compute capability 7.5

CPU max matches GPU max

Warp Aggregated Atomics PASSED

$ k get pods
NAME            READY   STATUS      RESTARTS   AGE
nvsmoke-wj4cz   0/1     Completed   0          48s

Enabled per command: apiclient set -j '{"settings": {"boot": {"init": {"systemd.unified_cgroup_hierarchy": ["1"]}}}}' and rebooted node.

Created new job to rerun workload test:

...
=========================================
  Running sample vectorAdd
=========================================

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

=========================================
  Running sample warpAggregatedAtomicsCG
=========================================

GPU Device 0: "Turing" with compute capability 7.5

CPU max matches GPU max

Warp Aggregated Atomics PASSED

$ k get pods
NAME            READY   STATUS      RESTARTS   AGE
nvsmoke-whmjs   0/1     Completed   0          4m31s

@markusboehme
Copy link
Member Author

Wonderful, thank you Sean! I need to play around with that workload test myself.

Copy link
Contributor

@arnaldo2792 arnaldo2792 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice troubleshooting, and very insightful (I'm still learning eBPF!) 🎉

@markusboehme markusboehme merged commit ed063a0 into bottlerocket-os:develop Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate what is required to support cgroups v2 in libnvidia-container
4 participants