Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interfaces/bulitin/opengl: add support for cuda workloads on Tegra iGPU in opengl interface #14536

Merged
merged 3 commits into from
Oct 3, 2024

Conversation

DocSepp
Copy link
Contributor

@DocSepp DocSepp commented Sep 25, 2024

Add support for cuda workloads on Tegra iGPU in opengl interface

When run on a Tegra iGPU, cuda workloads require 4 more device nodes
than what is currently included in the opengl interface.

Notably, three device nodes are related to the GPU. On an integrated GPU on
tegra platforms, those correspond to /dev/nvgpu/igpu0/power,
/dev/nvgpu/igpu0/ctrl and /dev/nvgpu/igpu0/prof.

Additionally, we need read-write access to /dev/host1x-fence, which is a
tegra specific dma barrier.

In addition to granting write access in the apparmor profiles, we also
need to tag these device nodes in the udev rules.

This pull request is related to a previous one that created a new interface for this functionality: #14188

The device nodes listed are the ones needed to run cuda workloads without errors. On top of that, cuda workloads also try to access other iGPU device nodes like as, channel and sched but only using the faccessat system call therefore I don't think they are necessary to run the workloads.

I tested the change to this interface using 5 different snaps that are listed here:

For testing I use the opengl branches when available. The cuda-runtime snap and tensorrt-libs snap only contain runtime libraries and nothing related to the opengl interface.

In order to test this on actual hardware we set up a Jenkins job that provisions our devices with an Ubuntu Core image, then fetches the modified snapd snap and the nvidia-tegra-drivers-36 snap from here: https://people.canonical.com/~sebwey/snaps/

For the cuda and tensorrt snaps, we don't have distribution rights, therefore I'm using a personal access token to download the build artifacts from their respective github actions.

The snaps are then installed and connected with the respective interfaces.

We then finally run the riverside-core-gpu tests defined in this checkbox provider: https://git.launchpad.net/~riverside-team/riverside/+git/checkbox-provider-riverside/tree/units/riverside/gpu.pxu

You can see one of the test runs on this Jenkins job (VPN necessary): http://10.102.156.15:8080/job/partner-engineering/job/riverside/job/riverside-core-jetson-orin-nano-daily-dangerous/22/

I look forward to hear your feedback on what's still missing. I'm happy to set up a meeting and provide access to hardware we have in the lab so you can have a look yourself.

@DocSepp
Copy link
Contributor Author

DocSepp commented Sep 26, 2024

Changed wording to better reflect that 4 device nodes are missing from the opengl interface to run cuda workloads on a tegra igpu

@zyga zyga self-requested a review October 2, 2024 14:40
Copy link
Contributor

@zyga zyga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good. Please try to use [0-9]* for numeric things that may go all the way up to eleven (counting from zero) if someone has a particularly big server with many GPUs.

I'll request security review for host1x-fence.

For reference the source code is submission and discussion is at https://lore.kernel.org/all/[email protected]/

@zyga zyga added the Needs security review Can only be merged once security gave a :+1: label Oct 2, 2024
Copy link
Member

@alfonsosanchezbeato alfonsosanchezbeato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but +1 on @zyga 's comment on using [0-9]*

When run on a Tegra iGPU, cuda workloads require 4 more device nodes
than what is currently included in the opengl interface.

Notably, three device nodes are related to the GPU. On an integrated GPU on
tegra platforms, those correspond to `/dev/nvgpu/igpu0/power`,
`/dev/nvgpu/igpu0/ctrl` and `/dev/nvgpu/igpu0/prof`.

Additionally, we need read-write access to `/dev/host1x-fence`, which is a
tegra specific dma barrier.

In addition to granting write access in the apparmor profiles, we also
need to tag these device nodes in the udev rules.

Signed-off-by: Sebastian Weyer <[email protected]>
Instead of just granting apparmor permissions to igpu 0 to 9, use a
wildcard to grant permissions going beyond 9.

Signed-off-by: Sebastian Weyer <[email protected]>
@DocSepp
Copy link
Contributor Author

DocSepp commented Oct 3, 2024

Thanks for the review. I added the change you mentioned.

Copy link
Contributor

@alexmurray alexmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks

@bboozzoo bboozzoo changed the title Add support for cuda workloads on Tegra iGPU in opengl interface interfaces/bulitin/opengl: add support for cuda workloads on Tegra iGPU in opengl interface Oct 3, 2024
Copy link
Contributor

@bboozzoo bboozzoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@bboozzoo bboozzoo added this to the 2.66 milestone Oct 3, 2024
@bboozzoo
Copy link
Contributor

bboozzoo commented Oct 3, 2024

@ernestl this one is worth considering for a cherry pick to 2.66

@ernestl ernestl closed this Oct 3, 2024
@ernestl ernestl reopened this Oct 3, 2024
@ernestl ernestl mentioned this pull request Oct 3, 2024
@bboozzoo
Copy link
Contributor

bboozzoo commented Oct 3, 2024

@DocSepp I took the liberty of pushing a fix for a failing unit test.

Copy link

codecov bot commented Oct 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.87%. Comparing base (ac897ee) to head (ac1521e).
Report is 52 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #14536      +/-   ##
==========================================
+ Coverage   78.85%   78.87%   +0.01%     
==========================================
  Files        1079     1083       +4     
  Lines      145615   146105     +490     
==========================================
+ Hits       114828   115234     +406     
- Misses      23601    23674      +73     
- Partials     7186     7197      +11     
Flag Coverage Δ
unittests 78.87% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ernestl
Copy link
Collaborator

ernestl commented Oct 3, 2024

Failures:

ubuntu-arm |

ubuntu-core-22 |

  • google-core:ubuntu-core-22-64:tests/core/gadget-update-pc | unrelated to changes

ubuntu-daily |

  • google:ubuntu-24.10-64:tests/main/snapd-state | unrelated to changes

ubuntu-focal-jammy |
Failed:

  • google:ubuntu-20.04-64:tests/main/lxd:snapd_cgroup_both | unrelated to changes
    Failed task prepare:
  • google:ubuntu-22.04-64:tests/main/install-sideload-epochs:reexec0 | unrelated to changes

ubuntu-xenial-bionic |

  • google:ubuntu-16.04-64:tests/main/auto-refresh-gating

@ernestl ernestl merged commit 9a82ad9 into canonical:master Oct 3, 2024
51 of 58 checks passed
ernestl pushed a commit to ernestl/snapd that referenced this pull request Oct 3, 2024
ernestl pushed a commit that referenced this pull request Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked Needs security review Can only be merged once security gave a :+1:
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants