interfaces/bulitin/opengl: add support for cuda workloads on Tegra iGPU in opengl interface #14536

DocSepp · 2024-09-25T15:36:45Z

Add support for cuda workloads on Tegra iGPU in opengl interface

When run on a Tegra iGPU, cuda workloads require 4 more device nodes
than what is currently included in the opengl interface.

Notably, three device nodes are related to the GPU. On an integrated GPU on
tegra platforms, those correspond to /dev/nvgpu/igpu0/power,
/dev/nvgpu/igpu0/ctrl and /dev/nvgpu/igpu0/prof.

Additionally, we need read-write access to /dev/host1x-fence, which is a
tegra specific dma barrier.

In addition to granting write access in the apparmor profiles, we also
need to tag these device nodes in the udev rules.

This pull request is related to a previous one that created a new interface for this functionality: #14188

The device nodes listed are the ones needed to run cuda workloads without errors. On top of that, cuda workloads also try to access other iGPU device nodes like as, channel and sched but only using the faccessat system call therefore I don't think they are necessary to run the workloads.

I tested the change to this interface using 5 different snaps that are listed here:

For testing I use the opengl branches when available. The cuda-runtime snap and tensorrt-libs snap only contain runtime libraries and nothing related to the opengl interface.

In order to test this on actual hardware we set up a Jenkins job that provisions our devices with an Ubuntu Core image, then fetches the modified snapd snap and the nvidia-tegra-drivers-36 snap from here: https://people.canonical.com/~sebwey/snaps/

For the cuda and tensorrt snaps, we don't have distribution rights, therefore I'm using a personal access token to download the build artifacts from their respective github actions.

The snaps are then installed and connected with the respective interfaces.

We then finally run the riverside-core-gpu tests defined in this checkbox provider: https://git.launchpad.net/~riverside-team/riverside/+git/checkbox-provider-riverside/tree/units/riverside/gpu.pxu

You can see one of the test runs on this Jenkins job (VPN necessary): http://10.102.156.15:8080/job/partner-engineering/job/riverside/job/riverside-core-jetson-orin-nano-daily-dangerous/22/

I look forward to hear your feedback on what's still missing. I'm happy to set up a meeting and provide access to hardware we have in the lab so you can have a look yourself.

DocSepp · 2024-09-26T07:08:26Z

Changed wording to better reflect that 4 device nodes are missing from the opengl interface to run cuda workloads on a tegra igpu

zyga

The changes look good. Please try to use [0-9]* for numeric things that may go all the way up to eleven (counting from zero) if someone has a particularly big server with many GPUs.

I'll request security review for host1x-fence.

For reference the source code is submission and discussion is at https://lore.kernel.org/all/[email protected]/

alfonsosanchezbeato

LGTM, but +1 on @zyga 's comment on using [0-9]*

When run on a Tegra iGPU, cuda workloads require 4 more device nodes than what is currently included in the opengl interface. Notably, three device nodes are related to the GPU. On an integrated GPU on tegra platforms, those correspond to `/dev/nvgpu/igpu0/power`, `/dev/nvgpu/igpu0/ctrl` and `/dev/nvgpu/igpu0/prof`. Additionally, we need read-write access to `/dev/host1x-fence`, which is a tegra specific dma barrier. In addition to granting write access in the apparmor profiles, we also need to tag these device nodes in the udev rules. Signed-off-by: Sebastian Weyer <[email protected]>

Instead of just granting apparmor permissions to igpu 0 to 9, use a wildcard to grant permissions going beyond 9. Signed-off-by: Sebastian Weyer <[email protected]>

DocSepp · 2024-10-03T05:08:54Z

Thanks for the review. I added the change you mentioned.

alexmurray

LGTM - thanks

bboozzoo

Thanks

bboozzoo · 2024-10-03T06:38:48Z

@ernestl this one is worth considering for a cherry pick to 2.66

Signed-off-by: Maciej Borzecki <[email protected]>

bboozzoo · 2024-10-03T10:37:57Z

@DocSepp I took the liberty of pushing a fix for a failing unit test.

codecov · 2024-10-03T10:55:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.87%. Comparing base (ac897ee) to head (ac1521e).
Report is 52 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #14536      +/-   ##
==========================================
+ Coverage   78.85%   78.87%   +0.01%     
==========================================
  Files        1079     1083       +4     
  Lines      145615   146105     +490     
==========================================
+ Hits       114828   115234     +406     
- Misses      23601    23674      +73     
- Partials     7186     7197      +11

Flag	Coverage Δ
unittests	`78.87% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ernestl · 2024-10-03T20:26:20Z

Failures:

ubuntu-arm |

google-arm:ubuntu-20.04-arm-64:tests/main/interfaces-time-control | test issue fixed tests/main/interfaces-time-control: use "name" device node since it is also present on arm systems #14572

ubuntu-core-22 |

google-core:ubuntu-core-22-64:tests/core/gadget-update-pc | unrelated to changes

ubuntu-daily |

google:ubuntu-24.10-64:tests/main/snapd-state | unrelated to changes

ubuntu-focal-jammy |
Failed:

google:ubuntu-20.04-64:tests/main/lxd:snapd_cgroup_both | unrelated to changes
Failed task prepare:
google:ubuntu-22.04-64:tests/main/install-sideload-epochs:reexec0 | unrelated to changes

ubuntu-xenial-bionic |

google:ubuntu-16.04-64:tests/main/auto-refresh-gating

…PU in opengl interface (canonical#14536) undefined

…PU in opengl interface (#14536) undefined

DocSepp force-pushed the opengl branch from dc9479e to c7205cc Compare September 26, 2024 07:07

zyga self-requested a review October 2, 2024 14:40

zyga approved these changes Oct 2, 2024

View reviewed changes

zyga requested review from bboozzoo, alexmurray and alfonsosanchezbeato October 2, 2024 14:45

zyga added the Needs security review Can only be merged once security gave a :+1: label Oct 2, 2024

alfonsosanchezbeato approved these changes Oct 2, 2024

View reviewed changes

DocSepp added 2 commits October 3, 2024 07:07

Use * wildcard for nvgpu/igpu node apparmor profiles

f8a3fae

Instead of just granting apparmor permissions to igpu 0 to 9, use a wildcard to grant permissions going beyond 9. Signed-off-by: Sebastian Weyer <[email protected]>

DocSepp force-pushed the opengl branch from c7205cc to f8a3fae Compare October 3, 2024 05:08

alexmurray approved these changes Oct 3, 2024

View reviewed changes

bboozzoo changed the title ~~Add support for cuda workloads on Tegra iGPU in opengl interface~~ interfaces/bulitin/opengl: add support for cuda workloads on Tegra iGPU in opengl interface Oct 3, 2024

bboozzoo approved these changes Oct 3, 2024

View reviewed changes

bboozzoo added this to the 2.66 milestone Oct 3, 2024

ernestl closed this Oct 3, 2024

ernestl reopened this Oct 3, 2024

ernestl mentioned this pull request Oct 3, 2024

release: 2.66 #14566

Merged

interfaces/buitlin/opengl: fix unit test

ac1521e

Signed-off-by: Maciej Borzecki <[email protected]>

ernestl merged commit 9a82ad9 into canonical:master Oct 3, 2024
51 of 58 checks passed

ernestl pushed a commit to ernestl/snapd that referenced this pull request Oct 3, 2024

interfaces/bulitin/opengl: add support for cuda workloads on Tegra iG…

b954524

…PU in opengl interface (canonical#14536) undefined

ernestl added the cherry-picked label Oct 3, 2024

ernestl pushed a commit that referenced this pull request Oct 4, 2024

interfaces/bulitin/opengl: add support for cuda workloads on Tegra iG…

a7f6446

…PU in opengl interface (#14536) undefined

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

interfaces/bulitin/opengl: add support for cuda workloads on Tegra iGPU in opengl interface #14536

interfaces/bulitin/opengl: add support for cuda workloads on Tegra iGPU in opengl interface #14536

DocSepp commented Sep 25, 2024 •

edited

Loading

DocSepp commented Sep 26, 2024

zyga left a comment

alfonsosanchezbeato left a comment

DocSepp commented Oct 3, 2024

alexmurray left a comment

bboozzoo left a comment

bboozzoo commented Oct 3, 2024

bboozzoo commented Oct 3, 2024

codecov bot commented Oct 3, 2024 •

edited

Loading

ernestl commented Oct 3, 2024 •

edited

Loading

interfaces/bulitin/opengl: add support for cuda workloads on Tegra iGPU in opengl interface #14536

interfaces/bulitin/opengl: add support for cuda workloads on Tegra iGPU in opengl interface #14536

Conversation

DocSepp commented Sep 25, 2024 • edited Loading

DocSepp commented Sep 26, 2024

zyga left a comment

Choose a reason for hiding this comment

alfonsosanchezbeato left a comment

Choose a reason for hiding this comment

DocSepp commented Oct 3, 2024

alexmurray left a comment

Choose a reason for hiding this comment

bboozzoo left a comment

Choose a reason for hiding this comment

bboozzoo commented Oct 3, 2024

bboozzoo commented Oct 3, 2024

codecov bot commented Oct 3, 2024 • edited Loading

Codecov Report

ernestl commented Oct 3, 2024 • edited Loading

DocSepp commented Sep 25, 2024 •

edited

Loading

codecov bot commented Oct 3, 2024 •

edited

Loading

ernestl commented Oct 3, 2024 •

edited

Loading