-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
interfaces/bulitin/opengl: add support for cuda workloads on Tegra iGPU in opengl interface #14536
Conversation
Changed wording to better reflect that 4 device nodes are missing from the opengl interface to run cuda workloads on a tegra igpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good. Please try to use [0-9]*
for numeric things that may go all the way up to eleven (counting from zero) if someone has a particularly big server with many GPUs.
I'll request security review for host1x-fence
.
For reference the source code is submission and discussion is at https://lore.kernel.org/all/[email protected]/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but +1 on @zyga 's comment on using [0-9]*
When run on a Tegra iGPU, cuda workloads require 4 more device nodes than what is currently included in the opengl interface. Notably, three device nodes are related to the GPU. On an integrated GPU on tegra platforms, those correspond to `/dev/nvgpu/igpu0/power`, `/dev/nvgpu/igpu0/ctrl` and `/dev/nvgpu/igpu0/prof`. Additionally, we need read-write access to `/dev/host1x-fence`, which is a tegra specific dma barrier. In addition to granting write access in the apparmor profiles, we also need to tag these device nodes in the udev rules. Signed-off-by: Sebastian Weyer <[email protected]>
Instead of just granting apparmor permissions to igpu 0 to 9, use a wildcard to grant permissions going beyond 9. Signed-off-by: Sebastian Weyer <[email protected]>
Thanks for the review. I added the change you mentioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
@ernestl this one is worth considering for a cherry pick to 2.66 |
Signed-off-by: Maciej Borzecki <[email protected]>
@DocSepp I took the liberty of pushing a fix for a failing unit test. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #14536 +/- ##
==========================================
+ Coverage 78.85% 78.87% +0.01%
==========================================
Files 1079 1083 +4
Lines 145615 146105 +490
==========================================
+ Hits 114828 115234 +406
- Misses 23601 23674 +73
- Partials 7186 7197 +11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Failures: ubuntu-arm |
ubuntu-core-22 |
ubuntu-daily |
ubuntu-focal-jammy |
ubuntu-xenial-bionic |
|
…PU in opengl interface (canonical#14536) undefined
…PU in opengl interface (#14536) undefined
Add support for cuda workloads on Tegra iGPU in opengl interface
When run on a Tegra iGPU, cuda workloads require 4 more device nodes
than what is currently included in the opengl interface.
Notably, three device nodes are related to the GPU. On an integrated GPU on
tegra platforms, those correspond to
/dev/nvgpu/igpu0/power
,/dev/nvgpu/igpu0/ctrl
and/dev/nvgpu/igpu0/prof
.Additionally, we need read-write access to
/dev/host1x-fence
, which is ategra specific dma barrier.
In addition to granting write access in the apparmor profiles, we also
need to tag these device nodes in the udev rules.
This pull request is related to a previous one that created a new interface for this functionality: #14188
The device nodes listed are the ones needed to run cuda workloads without errors. On top of that, cuda workloads also try to access other iGPU device nodes like
as
,channel
andsched
but only using thefaccessat
system call therefore I don't think they are necessary to run the workloads.I tested the change to this interface using 5 different snaps that are listed here:
For testing I use the
opengl
branches when available. The cuda-runtime snap and tensorrt-libs snap only contain runtime libraries and nothing related to the opengl interface.In order to test this on actual hardware we set up a Jenkins job that provisions our devices with an Ubuntu Core image, then fetches the modified snapd snap and the nvidia-tegra-drivers-36 snap from here: https://people.canonical.com/~sebwey/snaps/
For the cuda and tensorrt snaps, we don't have distribution rights, therefore I'm using a personal access token to download the build artifacts from their respective github actions.
The snaps are then installed and connected with the respective interfaces.
We then finally run the riverside-core-gpu tests defined in this checkbox provider: https://git.launchpad.net/~riverside-team/riverside/+git/checkbox-provider-riverside/tree/units/riverside/gpu.pxu
You can see one of the test runs on this Jenkins job (VPN necessary): http://10.102.156.15:8080/job/partner-engineering/job/riverside/job/riverside-core-jetson-orin-nano-daily-dangerous/22/
I look forward to hear your feedback on what's still missing. I'm happy to set up a meeting and provide access to hardware we have in the lab so you can have a look yourself.