Add Module Setup and Teardown Callback to Linkable Code Interface#145
Add Module Setup and Teardown Callback to Linkable Code Interface#145isVoid merged 41 commits intoNVIDIA:mainfrom
Conversation
Co-authored-by: Graham Markall <535640+gmarkall@users.noreply.github.com>
|
I just merged main into this PR so it can (potentially) succeed on CI. |
| module_obj = ctx.modules.get(key, None) | ||
|
|
||
| if module_obj is not None: | ||
| weakref.finalize( |
There was a problem hiding this comment.
I spent a while trying to refresh my memory of what we do with finalizers in Numba - numba/numba#10001 may be worth a quick look in conjunction with thinking about the behaviour here.
There was a problem hiding this comment.
It even took me 30min to understand this refresh PR. I think this is super subtle. Especially the point "the first finalizer registered also calls atexit.register to place a function that invokes all finalizers." Is this documented anywhere or just a CPython detail?
It's up to the user to decide whether to inspect the shutting_down variable. If they decide to check that, it's probably a good idea to put a reminder about this subtlety. For the most part, if the finalizer is to operate on just the modules, we are probably fine since they are still alive when the finalizer is invoked.
There was a problem hiding this comment.
Thanks for the comments - I need to have another go at explaining it, and your questions are good questions whose answers I should incorporate into an updated version of the PR.
| debug=self.debug, lineinfo=self.lineinfo, | ||
| call_helper=self.call_helper, extensions=self.extensions) | ||
|
|
||
| @module_init_lock |
There was a problem hiding this comment.
Is there any risk where at destruction / unloading not acquiring this lock that we end up in a bad situation?
There was a problem hiding this comment.
I'm crafting a concurrent test so that we can get a more in depth view of the behavior of setup and teardown with threads. I'll report back.
There was a problem hiding this comment.
When I delved into the multi-threading kernel launches, this is what I discovered:
- The compilation is run once for the first thread that acquired the compilation lock, and the subsequent thread can load the same compiled binary from cache.
- Each thread creates its own cumodule pointer and independently invokes the
cumoduleLoadDataExon their own copy of the pointer. SoNmodules will be created if there areNthreads.
What this means for the setup callback function:
- Setup function is called N times, if there are N threads. Each invocation with a unique module handle.
- There is a lock for the setup section, the N invocations will take place in serial. No race condition.
What this means for the teardown callback function (Today):
- Today, user cannot teardown the kernel. So we assume that all teardown happens after all threads join and the main thread takes care of the interpreter shutdown and the finalizers. Finalizers are placed in a FILO stack, so they are invoked in serial. No race condition.
What this means for the teardown callback function (If we implement #171)
[EDITED]
- Each thread holds a reference to the kernel, so
del kernelonly reduces the count where each thread increments. The main thread still holds the initial reference to the kernel. In this case kernel is still finalized at interpreter shut down and the above still holds.
There was a problem hiding this comment.
- Each thread creates its own cumodule pointer and independently invokes the
cumoduleLoadDataExon their own copy of the pointer. SoNmodules will be created if there areNthreads.
Does calling cumoduleLoadDataEx consume valuable resources like global device memory? I would assume so.
Do we have one CUDA Context being used by multiple threads?
If the answer is yes to both of these questions, maybe we should implement a lock where the first thread acquires the lock and creates the module and then all future threads can retrieve the module from a cache? Then we presumably only need to call the setup callback function once instead of N times?
There was a problem hiding this comment.
- Each thread creates its own cumodule pointer and independently invokes the
cumoduleLoadDataExon their own copy of the pointer. SoNmodules will be created if there areNthreads.Does calling
cumoduleLoadDataExconsume valuable resources like global device memory? I would assume so.Do we have one CUDA Context being used by multiple threads?
If the answer is yes to both of these questions, maybe we should implement a lock where the first thread acquires the lock and creates the module and then all future threads can retrieve the module from a cache? Then we presumably only need to call the setup callback function once instead of N times?
Yes and yes, I think Graham and I touched these questions before and we agreed there were some subtle bug with module creation in Numba. Your suggestion makes sense to me.
There was a problem hiding this comment.
Each thread creates its own cumodule pointer and independently invokes the
cuModuleLoadDataExon their own copy of the pointer. So N modules will be created if there are N threads.
I thought this sounded odd, and it's definitely not what we want. There should be one module per context, not per thread - I noted why in the main PR comments: #145 (comment)
There was a problem hiding this comment.
I'm happy to address the issue in this PR.
|
Just circling back to this PR. This fail seems to suggest we hae some issue somewhere: |
|
If I put a global compiler lock in the diff --git a/numba_cuda/numba/cuda/dispatcher.py b/numba_cuda/numba/cuda/dispatcher.py
index e68c3b7..72344aa 100644
--- a/numba_cuda/numba/cuda/dispatcher.py
+++ b/numba_cuda/numba/cuda/dispatcher.py
@@ -1111,6 +1111,7 @@ class CUDADispatcher(Dispatcher, serialize.ReduceMixin):
self._insert(c_sig, kernel, cuda=True)
self.overloads[argtypes] = kernel
+ @global_compiler_lock
def compile(self, sig):
"""
Compile and bind to the current context a version of this kernelthen the test fails as: which is more like what I'd expect given what the test does - there are two kernel compilations (and therefore should be two modules) because of the int parameter and the float parameter. I'm not sure my diff above is the correct fix, but we clearly have a race condition exposed in |
I think this means that one thread is only scheduled after 1 out of the other 7 threads have finished compiling, and this thread reloads the module from cache, which means our current test is non-deterministic. I will implement #145 (comment) and get this behavior stable. |
I think that's OK - the outcome of the test was non-deterministic because we had a race condition in |
This was accidentally deleted in NVIDIA#145.
This was accidentally deleted in #145.
- Locate nvvm, libdevice, nvrtc, and cudart from nvidia-*-cu12 wheels (NVIDIA#155) - reinstate test (NVIDIA#226) - Restore PR NVIDIA#185 (Stop Certain Driver API Discovery for "v2") (NVIDIA#223) - Report NVRTC builtin operation failures to the user (NVIDIA#196) - Add Module Setup and Teardown Callback to Linkable Code Interface (NVIDIA#145) - Test CUDA 12.8. (NVIDIA#187) - Ensure RTC Bindings Clamp to the Maximum Supported CC (NVIDIA#189) - Migrate code style to ruff (NVIDIA#170) - Use less GPU memory in test_managed_alloc_driver_undersubscribe. (NVIDIA#188) - Update workflows to always use proxy cache. (NVIDIA#191)
- Locate nvvm, libdevice, nvrtc, and cudart from nvidia-*-cu12 wheels (#155) - reinstate test (#226) - Restore PR #185 (Stop Certain Driver API Discovery for "v2") (#223) - Report NVRTC builtin operation failures to the user (#196) - Add Module Setup and Teardown Callback to Linkable Code Interface (#145) - Test CUDA 12.8. (#187) - Ensure RTC Bindings Clamp to the Maximum Supported CC (#189) - Migrate code style to ruff (#170) - Use less GPU memory in test_managed_alloc_driver_undersubscribe. (#188) - Update workflows to always use proxy cache. (#191)
This PR adds
setup_callbackandteardown_callbackfield to the linkable code interface. When library developers pass in an external module, they can also pass in an external function to initialize the compiled and loaded module. When the cumodule is garbage collected in python, theteardown_callbackis also invoked to make sure no resources is leaked.Additionally this PR fixes a bug in module creation under multi-threaded scenario. Currently when multiple threads launch a kernel, it's non-deterministic how many times the compiler is run and modules are reloaded from cache. We implement compiler lock in this case to make sure that only a single instance per kernel is compiled and all subsequent threads should reload from the same instance. This saves compiler resources and makes the pipeline more predictable.
closes #138