Add Module Setup and Teardown Callback to Linkable Code Interface by isVoid · Pull Request #145 · NVIDIA/numba-cuda

isVoid · 2025-03-03T21:55:33Z

This PR adds setup_callback and teardown_callback field to the linkable code interface. When library developers pass in an external module, they can also pass in an external function to initialize the compiled and loaded module. When the cumodule is garbage collected in python, the teardown_callback is also invoked to make sure no resources is leaked.

Additionally this PR fixes a bug in module creation under multi-threaded scenario. Currently when multiple threads launch a kernel, it's non-deterministic how many times the compiler is run and modules are reloaded from cache. We implement compiler lock in this case to make sure that only a single instance per kernel is compiled and all subsequent threads should reload from the same instance. This saves compiler resources and makes the pipeline more predictable.

closes #138

numba_cuda/numba/cuda/tests/cudadrv/test_module_init.py

into fea-link-callback

numba_cuda/numba/cuda/cudadrv/managed_module.py

docs/source/user/cuda_ffi.rst

Co-authored-by: Graham Markall <535640+gmarkall@users.noreply.github.com>

gmarkall · 2025-03-19T10:46:53Z

I just merged main into this PR so it can (potentially) succeed on CI.

numba_cuda/numba/cuda/cudadrv/managed_module.py

gmarkall · 2025-03-19T11:30:09Z

numba_cuda/numba/cuda/cudadrv/managed_module.py

+        module_obj = ctx.modules.get(key, None)
+
+        if module_obj is not None:
+            weakref.finalize(


I spent a while trying to refresh my memory of what we do with finalizers in Numba - numba/numba#10001 may be worth a quick look in conjunction with thinking about the behaviour here.

It even took me 30min to understand this refresh PR. I think this is super subtle. Especially the point "the first finalizer registered also calls atexit.register to place a function that invokes all finalizers." Is this documented anywhere or just a CPython detail?

It's up to the user to decide whether to inspect the shutting_down variable. If they decide to check that, it's probably a good idea to put a reminder about this subtlety. For the most part, if the finalizer is to operate on just the modules, we are probably fine since they are still alive when the finalizer is invoked.

Thanks for the comments - I need to have another go at explaining it, and your questions are good questions whose answers I should incorporate into an updated version of the PR.

numba_cuda/numba/cuda/dispatcher.py

kkraus14 · 2025-03-26T19:35:27Z

numba_cuda/numba/cuda/dispatcher.py

                    debug=self.debug, lineinfo=self.lineinfo,
                    call_helper=self.call_helper, extensions=self.extensions)

+    @module_init_lock


Is there any risk where at destruction / unloading not acquiring this lock that we end up in a bad situation?

I'm crafting a concurrent test so that we can get a more in depth view of the behavior of setup and teardown with threads. I'll report back.

When I delved into the multi-threading kernel launches, this is what I discovered:

The compilation is run once for the first thread that acquired the compilation lock, and the subsequent thread can load the same compiled binary from cache.

Each thread creates its own cumodule pointer and independently invokes the cumoduleLoadDataEx on their own copy of the pointer. So N modules will be created if there are N threads.

What this means for the setup callback function:

Setup function is called N times, if there are N threads. Each invocation with a unique module handle.

There is a lock for the setup section, the N invocations will take place in serial. No race condition.

What this means for the teardown callback function (Today):

Today, user cannot teardown the kernel. So we assume that all teardown happens after all threads join and the main thread takes care of the interpreter shutdown and the finalizers. Finalizers are placed in a FILO stack, so they are invoked in serial. No race condition.

What this means for the teardown callback function (If we implement #171)

[EDITED]

Each thread holds a reference to the kernel, so del kernel only reduces the count where each thread increments. The main thread still holds the initial reference to the kernel. In this case kernel is still finalized at interpreter shut down and the above still holds.

Each thread creates its own cumodule pointer and independently invokes the cumoduleLoadDataEx on their own copy of the pointer. So N modules will be created if there are N threads.

Does calling cumoduleLoadDataEx consume valuable resources like global device memory? I would assume so.

Do we have one CUDA Context being used by multiple threads?

If the answer is yes to both of these questions, maybe we should implement a lock where the first thread acquires the lock and creates the module and then all future threads can retrieve the module from a cache? Then we presumably only need to call the setup callback function once instead of N times?

Each thread creates its own cumodule pointer and independently invokes the cumoduleLoadDataEx on their own copy of the pointer. So N modules will be created if there are N threads.

Does calling cumoduleLoadDataEx consume valuable resources like global device memory? I would assume so.

Do we have one CUDA Context being used by multiple threads?

If the answer is yes to both of these questions, maybe we should implement a lock where the first thread acquires the lock and creates the module and then all future threads can retrieve the module from a cache? Then we presumably only need to call the setup callback function once instead of N times?

Yes and yes, I think Graham and I touched these questions before and we agreed there were some subtle bug with module creation in Numba. Your suggestion makes sense to me.

Each thread creates its own cumodule pointer and independently invokes the cuModuleLoadDataEx on their own copy of the pointer. So N modules will be created if there are N threads.

I thought this sounded odd, and it's definitely not what we want. There should be one module per context, not per thread - I noted why in the main PR comments: #145 (comment)

I'm happy to address the issue in this PR.

… fea-link-callback

gmarkall · 2025-04-10T06:46:33Z

Just circling back to this PR. This fail seems to suggest we hae some issue somewhere:

======================================================================
FAIL: test_concurrent_initialization_different_args (numba.cuda.tests.cudadrv.test_module_callbacks.TestMultithreadedCallbacks.test_concurrent_initialization_different_args)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/pyenv/versions/3.12.9/lib/python3.12/site-packages/numba_cuda/numba/cuda/tests/cudadrv/test_module_callbacks.py", line 290, in test_concurrent_initialization_different_args
    self.assertEqual(max_seen_mods, 8)  # 2 kernels per thread
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 7 != 8

gmarkall · 2025-04-10T07:48:41Z

If I put a global compiler lock in theDispatcher.compile() method:

diff --git a/numba_cuda/numba/cuda/dispatcher.py b/numba_cuda/numba/cuda/dispatcher.py
index e68c3b7..72344aa 100644
--- a/numba_cuda/numba/cuda/dispatcher.py
+++ b/numba_cuda/numba/cuda/dispatcher.py
@@ -1111,6 +1111,7 @@ class CUDADispatcher(Dispatcher, serialize.ReduceMixin):
         self._insert(c_sig, kernel, cuda=True)
         self.overloads[argtypes] = kernel
 
+    @global_compiler_lock
     def compile(self, sig):
         """
         Compile and bind to the current context a version of this kernel

then the test fails as:

FAIL: test_concurrent_initialization_different_args (numba.cuda.tests.cudadrv.test_module_callbacks.TestMultithreadedCallbacks.test_concurrent_initialization_different_args)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/tests/cudadrv/test_module_callbacks.py", line 290, in test_concurrent_initialization_different_args
    self.assertEqual(max_seen_mods, 8)  # 2 kernels per thread
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 2 != 8

which is more like what I'd expect given what the test does - there are two kernel compilations (and therefore should be two modules) because of the int parameter and the float parameter. I'm not sure my diff above is the correct fix, but we clearly have a race condition exposed in Dispatcher.compile(), which is almost certainly related to the CI fail.

isVoid · 2025-04-10T20:35:34Z

AssertionError: 7 != 8

I think this means that one thread is only scheduled after 1 out of the other 7 threads have finished compiling, and this thread reloads the module from cache, which means our current test is non-deterministic.

I will implement #145 (comment) and get this behavior stable.

…e time

…link-callback

gmarkall · 2025-04-11T12:18:19Z

I think this means that one thread is only scheduled after 1 out of the other 7 threads have finished compiling, and this thread reloads the module from cache, which means our current test is non-deterministic.

I think that's OK - the outcome of the test was non-deterministic because we had a race condition in Dispatcher.compile(). Are we thinking about things the same way here?

This was accidentally deleted in NVIDIA#145.

This was accidentally deleted in #145.

- Locate nvvm, libdevice, nvrtc, and cudart from nvidia-*-cu12 wheels (NVIDIA#155) - reinstate test (NVIDIA#226) - Restore PR NVIDIA#185 (Stop Certain Driver API Discovery for "v2") (NVIDIA#223) - Report NVRTC builtin operation failures to the user (NVIDIA#196) - Add Module Setup and Teardown Callback to Linkable Code Interface (NVIDIA#145) - Test CUDA 12.8. (NVIDIA#187) - Ensure RTC Bindings Clamp to the Maximum Supported CC (NVIDIA#189) - Migrate code style to ruff (NVIDIA#170) - Use less GPU memory in test_managed_alloc_driver_undersubscribe. (NVIDIA#188) - Update workflows to always use proxy cache. (NVIDIA#191)

- Locate nvvm, libdevice, nvrtc, and cudart from nvidia-*-cu12 wheels (#155) - reinstate test (#226) - Restore PR #185 (Stop Certain Driver API Discovery for "v2") (#223) - Report NVRTC builtin operation failures to the user (#196) - Add Module Setup and Teardown Callback to Linkable Code Interface (#145) - Test CUDA 12.8. (#187) - Ensure RTC Bindings Clamp to the Maximum Supported CC (#189) - Migrate code style to ruff (#170) - Use less GPU memory in test_managed_alloc_driver_undersubscribe. (#188) - Update workflows to always use proxy cache. (#191)

initial

2827d0d

isVoid changed the title ~~Add module init callback to linkable code interface~~ Add Module Init Callback to Linkable Code Interface Mar 3, 2025

isVoid commented Mar 3, 2025

View reviewed changes

numba_cuda/numba/cuda/tests/cudadrv/test_module_init.py Outdated Show resolved Hide resolved

vyasr mentioned this pull request Mar 3, 2025

Upload wheels to PyPI from GitHub-hosted runner #142

Merged

gmarkall added the 2 - In Progress Currently a work in progress label Mar 4, 2025

isVoid changed the title ~~Add Module Init Callback to Linkable Code Interface~~ Add Module Init and Finalize Callback to Linkable Code Interface Mar 5, 2025

isVoid added 9 commits March 5, 2025 15:30

mocking object type using cuda.core objects

7806707

Update to ctypes module wrapper and type checks

bc10ce3

add managed module object

54bc46c

Merge branch 'fea-link-callback' of https://github.com/isVoid/numba-cuda

067ea82

into fea-link-callback

add two kernels test

51dd293

add kernel finalizer

d05f76b

removing kernel finalizers

b329831

add docstring

c6a9b73

add doc

881648c

isVoid marked this pull request as ready for review March 17, 2025 18:29

isVoid changed the title ~~Add Module Init and Finalize Callback to Linkable Code Interface~~ Add Module Setup and Teardown Callback to Linkable Code Interface Mar 17, 2025

isVoid commented Mar 18, 2025

View reviewed changes

numba_cuda/numba/cuda/cudadrv/managed_module.py Outdated Show resolved Hide resolved

gmarkall added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 18, 2025

gmarkall reviewed Mar 18, 2025

View reviewed changes

docs/source/user/cuda_ffi.rst Outdated Show resolved Hide resolved

gmarkall reviewed Mar 18, 2025

View reviewed changes

docs/source/user/cuda_ffi.rst Outdated Show resolved Hide resolved

isVoid and others added 2 commits March 18, 2025 13:07

Update docs/source/user/cuda_ffi.rst

6f7fe2a

Co-authored-by: Graham Markall <535640+gmarkall@users.noreply.github.com>

Merge remote-tracking branch 'NVIDIA/main' into fea-link-callback

856536e

gmarkall reviewed Mar 19, 2025

View reviewed changes

numba_cuda/numba/cuda/cudadrv/managed_module.py Outdated Show resolved Hide resolved

gmarkall reviewed Mar 19, 2025

View reviewed changes

numba_cuda/numba/cuda/cudadrv/managed_module.py Outdated Show resolved Hide resolved

gmarkall reviewed Mar 19, 2025

View reviewed changes

numba_cuda/numba/cuda/cudadrv/managed_module.py Outdated Show resolved Hide resolved

gmarkall reviewed Mar 19, 2025

View reviewed changes

add a test that involves two streams

d4cc4ac

gmarkall added the 4 - Waiting on CI Waiting for a CI run to finish successfully label Mar 24, 2025

kkraus14 reviewed Mar 24, 2025

View reviewed changes

numba_cuda/numba/cuda/dispatcher.py Outdated Show resolved Hide resolved

gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on CI Waiting for a CI run to finish successfully labels Mar 24, 2025

isVoid added 2 commits March 25, 2025 16:36

add lock to protect initialization secton

df7427e

add documentation

6aa1120

isVoid requested a review from gmarkall March 26, 2025 06:53

kkraus14 reviewed Mar 26, 2025

View reviewed changes

isVoid and others added 5 commits March 26, 2025 17:19

add multithreaded callback behavior test

40f4e9b

Merge branch 'fea-link-callback' of github.com:isVoid/numba-cuda into…

0ba5dee

… fea-link-callback

Replace flake8 with ruff and pre-commit-hooks

c24e0b2

Apply precommit

075b7bd

Merge remote-tracking branch 'origin' into fea-link-callback

4b3650a

gmarkall added the 4 - Waiting on author Waiting for author to respond to review label Apr 10, 2025

isVoid added 2 commits April 10, 2025 13:38

apply compile lock to make sure modules are not compiled more than on…

28d9dc8

…e time

Merge branch 'main' of https://github.com/NVIDIA/numba-cuda into fea-…

6781263

…link-callback

isVoid merged commit 43218af into NVIDIA:main Apr 11, 2025
35 checks passed

gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Apr 18, 2025

Restore PR NVIDIA#185 (Stop Certain Driver API Discovery for "v2")

b185131

This was accidentally deleted in NVIDIA#145.

This was referenced Apr 18, 2025

Restore PR #185 (Stop Certain Driver API Discovery for "v2") #223

Merged

[BUG] Global compiler lock and timers for Dispatcher.compile() and Dispatcher.compile_device() #225

Open

kkraus14 pushed a commit that referenced this pull request Apr 18, 2025

Restore PR #185 (Stop Certain Driver API Discovery for "v2") (#223)

73d550b

This was accidentally deleted in #145.

isVoid mentioned this pull request Apr 18, 2025

Reinstate Test Removed in #145 #226

Merged

gmarkall mentioned this pull request Apr 22, 2025

Bump version to 0.9.0 #229

Merged

gmarkall mentioned this pull request Dec 8, 2025

Migrate numba-cuda driver to use cuda.core.launch API #609

Merged

Conversation

isVoid commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gmarkall commented Mar 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isVoid Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

What this means for the setup callback function:

What this means for the teardown callback function (Today):

What this means for the teardown callback function (If we implement #171)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarkall commented Apr 10, 2025

Uh oh!

gmarkall commented Apr 10, 2025

Uh oh!

isVoid commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmarkall commented Apr 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

isVoid commented Mar 3, 2025 •

edited

Loading

isVoid Mar 26, 2025 •

edited

Loading

isVoid commented Apr 10, 2025 •

edited

Loading