Skip to content

Add Module Setup and Teardown Callback to Linkable Code Interface#145

Merged
isVoid merged 41 commits intoNVIDIA:mainfrom
isVoid:fea-link-callback
Apr 11, 2025
Merged

Add Module Setup and Teardown Callback to Linkable Code Interface#145
isVoid merged 41 commits intoNVIDIA:mainfrom
isVoid:fea-link-callback

Conversation

@isVoid
Copy link
Contributor

@isVoid isVoid commented Mar 3, 2025

This PR adds setup_callback and teardown_callback field to the linkable code interface. When library developers pass in an external module, they can also pass in an external function to initialize the compiled and loaded module. When the cumodule is garbage collected in python, the teardown_callback is also invoked to make sure no resources is leaked.

Additionally this PR fixes a bug in module creation under multi-threaded scenario. Currently when multiple threads launch a kernel, it's non-deterministic how many times the compiler is run and modules are reloaded from cache. We implement compiler lock in this case to make sure that only a single instance per kernel is compiled and all subsequent threads should reload from the same instance. This saves compiler resources and makes the pipeline more predictable.

closes #138

@isVoid isVoid changed the title Add module init callback to linkable code interface Add Module Init Callback to Linkable Code Interface Mar 3, 2025
@gmarkall gmarkall added the 2 - In Progress Currently a work in progress label Mar 4, 2025
@isVoid isVoid changed the title Add Module Init Callback to Linkable Code Interface Add Module Init and Finalize Callback to Linkable Code Interface Mar 5, 2025
@isVoid isVoid marked this pull request as ready for review March 17, 2025 18:29
@isVoid isVoid changed the title Add Module Init and Finalize Callback to Linkable Code Interface Add Module Setup and Teardown Callback to Linkable Code Interface Mar 17, 2025
@gmarkall gmarkall added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 18, 2025
isVoid and others added 2 commits March 18, 2025 13:07
Co-authored-by: Graham Markall <535640+gmarkall@users.noreply.github.com>
@gmarkall
Copy link
Contributor

I just merged main into this PR so it can (potentially) succeed on CI.

module_obj = ctx.modules.get(key, None)

if module_obj is not None:
weakref.finalize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a while trying to refresh my memory of what we do with finalizers in Numba - numba/numba#10001 may be worth a quick look in conjunction with thinking about the behaviour here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It even took me 30min to understand this refresh PR. I think this is super subtle. Especially the point "the first finalizer registered also calls atexit.register to place a function that invokes all finalizers." Is this documented anywhere or just a CPython detail?

It's up to the user to decide whether to inspect the shutting_down variable. If they decide to check that, it's probably a good idea to put a reminder about this subtlety. For the most part, if the finalizer is to operate on just the modules, we are probably fine since they are still alive when the finalizer is invoked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments - I need to have another go at explaining it, and your questions are good questions whose answers I should incorporate into an updated version of the PR.

@gmarkall gmarkall added the 4 - Waiting on CI Waiting for a CI run to finish successfully label Mar 24, 2025
@gmarkall gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on CI Waiting for a CI run to finish successfully labels Mar 24, 2025
@isVoid isVoid requested a review from gmarkall March 26, 2025 06:53
debug=self.debug, lineinfo=self.lineinfo,
call_helper=self.call_helper, extensions=self.extensions)

@module_init_lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any risk where at destruction / unloading not acquiring this lock that we end up in a bad situation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm crafting a concurrent test so that we can get a more in depth view of the behavior of setup and teardown with threads. I'll report back.

Copy link
Contributor Author

@isVoid isVoid Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I delved into the multi-threading kernel launches, this is what I discovered:

  • The compilation is run once for the first thread that acquired the compilation lock, and the subsequent thread can load the same compiled binary from cache.
  • Each thread creates its own cumodule pointer and independently invokes the cumoduleLoadDataEx on their own copy of the pointer. So N modules will be created if there are N threads.

What this means for the setup callback function:

  1. Setup function is called N times, if there are N threads. Each invocation with a unique module handle.
  2. There is a lock for the setup section, the N invocations will take place in serial. No race condition.

What this means for the teardown callback function (Today):

  • Today, user cannot teardown the kernel. So we assume that all teardown happens after all threads join and the main thread takes care of the interpreter shutdown and the finalizers. Finalizers are placed in a FILO stack, so they are invoked in serial. No race condition.

What this means for the teardown callback function (If we implement #171)

[EDITED]

  • Each thread holds a reference to the kernel, so del kernel only reduces the count where each thread increments. The main thread still holds the initial reference to the kernel. In this case kernel is still finalized at interpreter shut down and the above still holds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Each thread creates its own cumodule pointer and independently invokes the cumoduleLoadDataEx on their own copy of the pointer. So N modules will be created if there are N threads.

Does calling cumoduleLoadDataEx consume valuable resources like global device memory? I would assume so.

Do we have one CUDA Context being used by multiple threads?

If the answer is yes to both of these questions, maybe we should implement a lock where the first thread acquires the lock and creates the module and then all future threads can retrieve the module from a cache? Then we presumably only need to call the setup callback function once instead of N times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Each thread creates its own cumodule pointer and independently invokes the cumoduleLoadDataEx on their own copy of the pointer. So N modules will be created if there are N threads.

Does calling cumoduleLoadDataEx consume valuable resources like global device memory? I would assume so.

Do we have one CUDA Context being used by multiple threads?

If the answer is yes to both of these questions, maybe we should implement a lock where the first thread acquires the lock and creates the module and then all future threads can retrieve the module from a cache? Then we presumably only need to call the setup callback function once instead of N times?

Yes and yes, I think Graham and I touched these questions before and we agreed there were some subtle bug with module creation in Numba. Your suggestion makes sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each thread creates its own cumodule pointer and independently invokes the cuModuleLoadDataEx on their own copy of the pointer. So N modules will be created if there are N threads.

I thought this sounded odd, and it's definitely not what we want. There should be one module per context, not per thread - I noted why in the main PR comments: #145 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to address the issue in this PR.

@gmarkall
Copy link
Contributor

Just circling back to this PR. This fail seems to suggest we hae some issue somewhere:

======================================================================
FAIL: test_concurrent_initialization_different_args (numba.cuda.tests.cudadrv.test_module_callbacks.TestMultithreadedCallbacks.test_concurrent_initialization_different_args)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/pyenv/versions/3.12.9/lib/python3.12/site-packages/numba_cuda/numba/cuda/tests/cudadrv/test_module_callbacks.py", line 290, in test_concurrent_initialization_different_args
    self.assertEqual(max_seen_mods, 8)  # 2 kernels per thread
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 7 != 8

@gmarkall
Copy link
Contributor

If I put a global compiler lock in theDispatcher.compile() method:

diff --git a/numba_cuda/numba/cuda/dispatcher.py b/numba_cuda/numba/cuda/dispatcher.py
index e68c3b7..72344aa 100644
--- a/numba_cuda/numba/cuda/dispatcher.py
+++ b/numba_cuda/numba/cuda/dispatcher.py
@@ -1111,6 +1111,7 @@ class CUDADispatcher(Dispatcher, serialize.ReduceMixin):
         self._insert(c_sig, kernel, cuda=True)
         self.overloads[argtypes] = kernel
 
+    @global_compiler_lock
     def compile(self, sig):
         """
         Compile and bind to the current context a version of this kernel

then the test fails as:

FAIL: test_concurrent_initialization_different_args (numba.cuda.tests.cudadrv.test_module_callbacks.TestMultithreadedCallbacks.test_concurrent_initialization_different_args)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/tests/cudadrv/test_module_callbacks.py", line 290, in test_concurrent_initialization_different_args
    self.assertEqual(max_seen_mods, 8)  # 2 kernels per thread
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 2 != 8

which is more like what I'd expect given what the test does - there are two kernel compilations (and therefore should be two modules) because of the int parameter and the float parameter. I'm not sure my diff above is the correct fix, but we clearly have a race condition exposed in Dispatcher.compile(), which is almost certainly related to the CI fail.

@gmarkall gmarkall added the 4 - Waiting on author Waiting for author to respond to review label Apr 10, 2025
@isVoid
Copy link
Contributor Author

isVoid commented Apr 10, 2025

AssertionError: 7 != 8

I think this means that one thread is only scheduled after 1 out of the other 7 threads have finished compiling, and this thread reloads the module from cache, which means our current test is non-deterministic.

I will implement #145 (comment) and get this behavior stable.

@gmarkall
Copy link
Contributor

I think this means that one thread is only scheduled after 1 out of the other 7 threads have finished compiling, and this thread reloads the module from cache, which means our current test is non-deterministic.

I think that's OK - the outcome of the test was non-deterministic because we had a race condition in Dispatcher.compile(). Are we thinking about things the same way here?

@isVoid isVoid merged commit 43218af into NVIDIA:main Apr 11, 2025
35 checks passed
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Apr 18, 2025
kkraus14 pushed a commit that referenced this pull request Apr 18, 2025
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Apr 22, 2025
- Locate nvvm, libdevice, nvrtc, and cudart from nvidia-*-cu12 wheels (NVIDIA#155)
- reinstate test (NVIDIA#226)
- Restore PR NVIDIA#185 (Stop Certain Driver API Discovery for "v2") (NVIDIA#223)
- Report NVRTC builtin operation failures to the user (NVIDIA#196)
- Add Module Setup and Teardown Callback to Linkable Code Interface (NVIDIA#145)
- Test CUDA 12.8. (NVIDIA#187)
- Ensure RTC Bindings Clamp to the Maximum Supported CC (NVIDIA#189)
- Migrate code style to ruff (NVIDIA#170)
- Use less GPU memory in test_managed_alloc_driver_undersubscribe. (NVIDIA#188)
- Update workflows to always use proxy cache. (NVIDIA#191)
@gmarkall gmarkall mentioned this pull request Apr 22, 2025
gmarkall added a commit that referenced this pull request Apr 22, 2025
- Locate nvvm, libdevice, nvrtc, and cudart from nvidia-*-cu12 wheels (#155)
- reinstate test (#226)
- Restore PR #185 (Stop Certain Driver API Discovery for "v2") (#223)
- Report NVRTC builtin operation failures to the user (#196)
- Add Module Setup and Teardown Callback to Linkable Code Interface (#145)
- Test CUDA 12.8. (#187)
- Ensure RTC Bindings Clamp to the Maximum Supported CC (#189)
- Migrate code style to ruff (#170)
- Use less GPU memory in test_managed_alloc_driver_undersubscribe. (#188)
- Update workflows to always use proxy cache. (#191)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 - Waiting on author Waiting for author to respond to review 4 - Waiting on reviewer Waiting for reviewer to respond to author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA]Support Custom Module Initialization Callbacks

5 participants