Skip to content

Added Nvidia GPU support to the buildah-remote task#1529

Merged
chmeliik merged 1 commit into
konflux-ci:mainfrom
syedriko:syedriko-gpu
Nov 21, 2024
Merged

Added Nvidia GPU support to the buildah-remote task#1529
chmeliik merged 1 commit into
konflux-ci:mainfrom
syedriko:syedriko-gpu

Conversation

@syedriko

@syedriko syedriko commented Oct 23, 2024

Copy link
Copy Markdown
Contributor

Added Nvidia GPU support to the buildah-remote task

@brianwcook

brianwcook commented Oct 24, 2024

Copy link
Copy Markdown
Member

I left you a couple reviews. The buildah-remote tasks are generated by running /hack/generate-buildah-remote.sh in this repo. This ensures they stay consistent with the normal buildah task. You need to modify the main.go called by generate-buildah-remote.sh so that when you run it, it produces the same diff this PR has. Once you run it, the PR should have 3 changed files: the generate script, buildah-remote/0.1/buildah-remote.yaml and buildah-remote/0.2/buildah-remote.yaml.

After that, you also need to run the /hack/generate-ta-tasks.sh. which will update 2 more files (trusted artifacts versions of the 2 buildah remote tasks).

Summary: you will modify 1 file (main.go), run two generate commands and add all those changes here.

@syedriko syedriko changed the title Added Nvidia GPU support to the buildah-remote 2.0 task Added Nvidia GPU support to the buildah-remote task Oct 24, 2024
@syedriko

Copy link
Copy Markdown
Contributor Author

Thanks, @brianwcook! PTAL

@brianwcook

Copy link
Copy Markdown
Member

/ok-to-test

chmod +x scripts/script-build.sh

PODMAN_NVIDIA_ARGS=()
if [[ "$PLATFORM" == "linux-g"* ]]; then

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried in the past to not depend on the semantics of the PLATFORM parameter. @ifireball @mshaposhnik, what do you think?

The use of the PLATFORM parameter like this would fall in line with the functionality requested in https://issues.redhat.com/browse/KONFLUX-4073.

@chmeliik

Copy link
Copy Markdown
Contributor

Could you describe what these changes do, how and why? The code change on its own doesn't give me much to work with


PODMAN_NVIDIA_ARGS=()
if [[ "$PLATFORM" == "linux-g"* ]]; then
PODMAN_NVIDIA_ARGS+=("--device nvidia.com/gpu=all" "--security-opt=label=disable")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications of --security-opt=label=disable?

@chmeliik

Copy link
Copy Markdown
Contributor

I would also consider dropping this PR in favor of #1530, although that one seems like it potentially gives the user too much control

@syedriko

syedriko commented Oct 25, 2024

Copy link
Copy Markdown
Contributor Author

Could you describe what these changes do, how and why? The code change on its own doesn't give me much to work with

The goal here is to allow Konflux builds to access Nvidia GPUs on machines so equipped. An example is running PyTorch during container build - https://github.com/openshift/lightspeed-rag-content/blob/main/Containerfile.

This PR is a building block towards support of this scenario. The others are AWS instance type(s) in multi-platform controller https://github.com/redhat-appstudio/infra-deployments/blob/0b936310854c7b4031b967eda33ad8399f12da60/components/multi-platform-controller/production/common/host-config.yaml#L528 and an AMI with Nvidia drivers.

This PR, for platforms that start with "linux-g", tells podman to pass though Nvidia GPU devices to the containers it runs.

What are the implications of --security-opt=label=disable?

Not too sure about it beyond the obvious, but this came from Nvidia docs https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.14.2/cdi-support.html

Upd: attempted dropping --security-opt=label=disable, build container couldn't access the GPU device.

@brianwcook

Copy link
Copy Markdown
Member

/ok-to-test

@syedriko

Copy link
Copy Markdown
Contributor Author

@brianwcook @chmeliik @arewm It looks to me like #1530 has morphed into something we can't use instead of this PR. Could you PTAL and see if we can revive this PR.

@brianwcook

Copy link
Copy Markdown
Member

I am +1 to merging this. We can revisit the enablement logic (vm's with instance types starting with 'g') at a later date if / when MPC changes how to label platforms, probably with no impact to end users. In addition the impact from disabling security labeling looks minimal since we 1) do not reuse build VMs 2) we do not run concurrent builds on a VM and 3) this only applies to remote builders, not OpenShift based builds.

Comment thread task/buildah-remote/0.2/buildah-remote.yaml Outdated
@syedriko syedriko force-pushed the syedriko-gpu branch 2 times, most recently from 95fa30d to 72ee87f Compare November 20, 2024 19:00
@chmeliik

Copy link
Copy Markdown
Contributor

/ok-to-test

@chmeliik chmeliik left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@chmeliik chmeliik enabled auto-merge November 21, 2024 08:59
@chmeliik chmeliik added this pull request to the merge queue Nov 21, 2024
Merged via the queue into konflux-ci:main with commit 51cb724 Nov 21, 2024
arewm added a commit to arewm/build-definitions that referenced this pull request Nov 21, 2024
Added in konflux-ci#1529 due to tektoncd/pipeline#8388
as this is not yet deployed in the cluster.

This reverts commit 51cb724.
github-merge-queue Bot pushed a commit that referenced this pull request Nov 22, 2024
Added in #1529 due to tektoncd/pipeline#8388
as this is not yet deployed in the cluster.

This reverts commit 51cb724.
manish-jangra pushed a commit to manish-jangra/build-definitions that referenced this pull request Aug 14, 2025
Added in konflux-ci#1529 due to tektoncd/pipeline#8388
as this is not yet deployed in the cluster.

This reverts commit 51cb724.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants