Added Nvidia GPU support to the buildah-remote task#1529
Conversation
|
I left you a couple reviews. The buildah-remote tasks are generated by running /hack/generate-buildah-remote.sh in this repo. This ensures they stay consistent with the normal buildah task. You need to modify the main.go called by generate-buildah-remote.sh so that when you run it, it produces the same diff this PR has. Once you run it, the PR should have 3 changed files: the generate script, buildah-remote/0.1/buildah-remote.yaml and buildah-remote/0.2/buildah-remote.yaml. After that, you also need to run the /hack/generate-ta-tasks.sh. which will update 2 more files (trusted artifacts versions of the 2 buildah remote tasks). Summary: you will modify 1 file (main.go), run two generate commands and add all those changes here. |
9766338 to
9196222
Compare
9196222 to
c67d9dd
Compare
|
Thanks, @brianwcook! PTAL |
|
/ok-to-test |
c67d9dd to
8c7d22d
Compare
| chmod +x scripts/script-build.sh | ||
|
|
||
| PODMAN_NVIDIA_ARGS=() | ||
| if [[ "$PLATFORM" == "linux-g"* ]]; then |
There was a problem hiding this comment.
I have tried in the past to not depend on the semantics of the PLATFORM parameter. @ifireball @mshaposhnik, what do you think?
The use of the PLATFORM parameter like this would fall in line with the functionality requested in https://issues.redhat.com/browse/KONFLUX-4073.
8c7d22d to
5c798da
Compare
|
Could you describe what these changes do, how and why? The code change on its own doesn't give me much to work with |
|
|
||
| PODMAN_NVIDIA_ARGS=() | ||
| if [[ "$PLATFORM" == "linux-g"* ]]; then | ||
| PODMAN_NVIDIA_ARGS+=("--device nvidia.com/gpu=all" "--security-opt=label=disable") |
There was a problem hiding this comment.
What are the implications of --security-opt=label=disable?
|
I would also consider dropping this PR in favor of #1530, although that one seems like it potentially gives the user too much control |
The goal here is to allow Konflux builds to access Nvidia GPUs on machines so equipped. An example is running PyTorch during container build - https://github.com/openshift/lightspeed-rag-content/blob/main/Containerfile. This PR is a building block towards support of this scenario. The others are AWS instance type(s) in multi-platform controller https://github.com/redhat-appstudio/infra-deployments/blob/0b936310854c7b4031b967eda33ad8399f12da60/components/multi-platform-controller/production/common/host-config.yaml#L528 and an AMI with Nvidia drivers. This PR, for platforms that start with "linux-g", tells podman to pass though Nvidia GPU devices to the containers it runs.
Not too sure about it beyond the obvious, but this came from Nvidia docs https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.14.2/cdi-support.html Upd: attempted dropping |
5c798da to
c390881
Compare
c390881 to
611a208
Compare
|
/ok-to-test |
|
@brianwcook @chmeliik @arewm It looks to me like #1530 has morphed into something we can't use instead of this PR. Could you PTAL and see if we can revive this PR. |
|
I am +1 to merging this. We can revisit the enablement logic (vm's with instance types starting with 'g') at a later date if / when MPC changes how to label platforms, probably with no impact to end users. In addition the impact from disabling security labeling looks minimal since we 1) do not reuse build VMs 2) we do not run concurrent builds on a VM and 3) this only applies to remote builders, not OpenShift based builds. |
611a208 to
2a86889
Compare
95fa30d to
72ee87f
Compare
72ee87f to
c635bd7
Compare
|
/ok-to-test |
Added in konflux-ci#1529 due to tektoncd/pipeline#8388 as this is not yet deployed in the cluster. This reverts commit 51cb724.
Added in #1529 due to tektoncd/pipeline#8388 as this is not yet deployed in the cluster. This reverts commit 51cb724.
Added in konflux-ci#1529 due to tektoncd/pipeline#8388 as this is not yet deployed in the cluster. This reverts commit 51cb724.
Added Nvidia GPU support to the buildah-remote task