Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
2827d0d
initial
isVoid Mar 3, 2025
7806707
mocking object type using cuda.core objects
isVoid Mar 5, 2025
bc10ce3
Update to ctypes module wrapper and type checks
isVoid Mar 13, 2025
54bc46c
add managed module object
isVoid Mar 14, 2025
067ea82
Merge branch 'fea-link-callback' of https://github.com/isVoid/numba-c…
isVoid Mar 14, 2025
51dd293
add two kernels test
isVoid Mar 15, 2025
d05f76b
add kernel finalizer
isVoid Mar 17, 2025
b329831
removing kernel finalizers
isVoid Mar 17, 2025
c6a9b73
add docstring
isVoid Mar 17, 2025
881648c
add doc
isVoid Mar 17, 2025
6f7fe2a
Update docs/source/user/cuda_ffi.rst
isVoid Mar 18, 2025
856536e
Merge remote-tracking branch 'NVIDIA/main' into fea-link-callback
gmarkall Mar 19, 2025
d4cc4ac
add a test that involves two streams
isVoid Mar 19, 2025
af2d618
add wipe all module call
isVoid Mar 19, 2025
a5a3671
use context reset is a better option to unload modules
isVoid Mar 19, 2025
577c8ff
add test for stream completeness
isVoid Mar 19, 2025
d8f2f23
move logic into CtypesModule
isVoid Mar 20, 2025
bb13580
update docstrings
isVoid Mar 20, 2025
379a69b
consolidate changes into create_module_image
isVoid Mar 20, 2025
c5d21fe
explicitly delete kernel reference
isVoid Mar 20, 2025
36ad115
remove stream from setup and teardown callbacks
isVoid Mar 21, 2025
2cc4c28
remove counter
isVoid Mar 21, 2025
01d2d85
add API coverage test
isVoid Mar 21, 2025
24e4260
asserting types of passed in module handle
isVoid Mar 21, 2025
328c430
update documentation
isVoid Mar 21, 2025
43bcfa2
address review comments
isVoid Mar 21, 2025
7663093
update linkable code doc
isVoid Mar 21, 2025
97367ce
update the tests to acommodate nvidia bindings
isVoid Mar 21, 2025
1535233
setup should raise an error if module is already initialized
isVoid Mar 21, 2025
79dd76e
Update docs/source/user/cuda_ffi.rst
isVoid Mar 21, 2025
b0ff099
add input type guards for linkable code
isVoid Mar 21, 2025
497f0eb
Fix docstrings
gmarkall Mar 24, 2025
df7427e
add lock to protect initialization secton
isVoid Mar 25, 2025
6aa1120
add documentation
isVoid Mar 25, 2025
40f4e9b
add multithreaded callback behavior test
isVoid Mar 27, 2025
0ba5dee
Merge branch 'fea-link-callback' of github.com:isVoid/numba-cuda into…
isVoid Mar 27, 2025
c24e0b2
Replace flake8 with ruff and pre-commit-hooks
ZzEeKkAa Mar 21, 2025
075b7bd
Apply precommit
isVoid Apr 9, 2025
4b3650a
Merge remote-tracking branch 'origin' into fea-link-callback
isVoid Apr 9, 2025
28d9dc8
apply compile lock to make sure modules are not compiled more than on…
isVoid Apr 10, 2025
6781263
Merge branch 'main' of https://github.com/NVIDIA/numba-cuda into fea-…
isVoid Apr 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions docs/source/user/cuda_ffi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,27 @@ CUDA C/C++ source code will be compiled with the `NVIDIA Runtime Compiler
kernel as either PTX or LTOIR, depending on whether LTO is enabled. Other files
will be passed directly to the CUDA Linker.

A ``LinkableCode`` object may have setup and teardown callback functions that
perform module-specific initialization and cleanup tasks.

* Setup functions are invoked once for every new module loaded.
* Teardown functions are invoked just prior to module unloading.

Both setup and teardown callbacks are called with a handle to the relevant
module. In practice, Numba creates a new module each time a kernel is compiled
for a specific set of argument types.

For each module, the setup callback is invoked once only. When a module is
executed by multiple threads, only one thread will execute the setup
callback.

The callbacks are defined as follows:

.. code::

def setup_callback(mod: cuda.cudadrv.drvapi.cu_module):...
def teardown_callback(mod: cuda.cudadrv.drvapi.cu_module):...

:class:`LinkableCode <numba.cuda.LinkableCode>` objects are initialized using
the parameters of their base class:

Expand Down
21 changes: 18 additions & 3 deletions numba_cuda/numba/cuda/codegen.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from numba.core.codegen import Codegen, CodeLibrary
from .cudadrv import devices, driver, nvvm, runtime
from numba.cuda.cudadrv.libs import get_cudalib
from numba.cuda.cudadrv.linkable_code import LinkableCode

import os
import subprocess
Expand Down Expand Up @@ -99,6 +100,12 @@ def __init__(
# Files to link with the generated PTX. These are linked using the
# Driver API at link time.
self._linking_files = set()
# List of setup functions to the loaded module
# the order is determined by the order they are added to the codelib.
self._setup_functions = []
# List of teardown functions to the loaded module
# the order is determined by the order they are added to the codelib.
self._teardown_functions = []
# Should we link libcudadevrt?
self.needs_cudadevrt = False

Expand Down Expand Up @@ -251,7 +258,9 @@ def get_cufunc(self):
return cufunc

cubin = self.get_cubin(cc=device.compute_capability)
module = ctx.create_module_image(cubin)
module = ctx.create_module_image(
cubin, self._setup_functions, self._teardown_functions
)

# Load
cufunc = module.get_function(self._entry_name)
Expand Down Expand Up @@ -289,8 +298,14 @@ def add_linking_library(self, library):

self._linking_libraries.add(library)

def add_linking_file(self, filepath):
self._linking_files.add(filepath)
def add_linking_file(self, path_or_obj):
if isinstance(path_or_obj, LinkableCode):
if path_or_obj.setup_callback:
self._setup_functions.append(path_or_obj.setup_callback)
if path_or_obj.teardown_callback:
self._teardown_functions.append(path_or_obj.teardown_callback)

self._linking_files.add(path_or_obj)

def get_function(self, name):
for fn in self._module.functions:
Expand Down
80 changes: 69 additions & 11 deletions numba_cuda/numba/cuda/cudadrv/driver.py
Original file line number Diff line number Diff line change
Expand Up @@ -381,9 +381,6 @@ def _find_api(self, fname):
else:
variants = ("_v2", "")

if fname in ("cuCtxGetDevice", "cuCtxSynchronize"):
return getattr(self.lib, fname)

for variant in variants:
try:
return getattr(self.lib, f"{fname}{variant}")
Expand Down Expand Up @@ -1478,8 +1475,12 @@ def create_module_ptx(self, ptx):
image = c_char_p(ptx)
return self.create_module_image(image)

def create_module_image(self, image):
module = load_module_image(self, image)
def create_module_image(
self, image, setup_callbacks=None, teardown_callbacks=None
):
module = load_module_image(
self, image, setup_callbacks, teardown_callbacks
)
if USE_NV_BINDING:
key = module.handle
else:
Expand Down Expand Up @@ -1578,17 +1579,25 @@ def __ne__(self, other):
return not self.__eq__(other)


def load_module_image(context, image):
def load_module_image(
context, image, setup_callbacks=None, teardown_callbacks=None
):
"""
image must be a pointer
"""
if USE_NV_BINDING:
return load_module_image_cuda_python(context, image)
return load_module_image_cuda_python(
context, image, setup_callbacks, teardown_callbacks
)
else:
return load_module_image_ctypes(context, image)
return load_module_image_ctypes(
context, image, setup_callbacks, teardown_callbacks
)


def load_module_image_ctypes(context, image):
def load_module_image_ctypes(
context, image, setup_callbacks, teardown_callbacks
):
logsz = config.CUDA_LOG_SIZE

jitinfo = (c_char * logsz)()
Expand Down Expand Up @@ -1621,10 +1630,14 @@ def load_module_image_ctypes(context, image):
handle,
info_log,
_module_finalizer(context, handle),
setup_callbacks,
teardown_callbacks,
)


def load_module_image_cuda_python(context, image):
def load_module_image_cuda_python(
context, image, setup_callbacks, teardown_callbacks
):
"""
image must be a pointer
"""
Expand Down Expand Up @@ -1661,6 +1674,8 @@ def load_module_image_cuda_python(context, image):
handle,
info_log,
_module_finalizer(context, handle),
setup_callbacks,
teardown_callbacks,
)


Expand Down Expand Up @@ -2455,13 +2470,27 @@ def event_elapsed_time(evtstart, evtend):
class Module(metaclass=ABCMeta):
"""Abstract base class for modules"""

def __init__(self, context, handle, info_log, finalizer=None):
def __init__(
self,
context,
handle,
info_log,
finalizer=None,
setup_callbacks=None,
teardown_callbacks=None,
):
self.context = context
self.handle = handle
self.info_log = info_log
if finalizer is not None:
self._finalizer = weakref.finalize(self, finalizer)

self.initialized = False
self.setup_functions = setup_callbacks
self.teardown_functions = teardown_callbacks

self._set_finalizers()

def unload(self):
"""Unload this module from the context"""
self.context.unload_module(self)
Expand All @@ -2474,6 +2503,35 @@ def get_function(self, name):
def get_global_symbol(self, name):
"""Return a MemoryPointer referring to the named symbol"""

def setup(self):
"""Call the setup functions for the module"""
if self.initialized:
raise RuntimeError("The module has already been initialized.")

if self.setup_functions is None:
return

for f in self.setup_functions:
f(self.handle)

self.initialized = True

def _set_finalizers(self):
"""Create finalizers that tear down the module."""
if self.teardown_functions is None:
return

def _teardown(teardowns, handle):
for f in teardowns:
f(handle)

weakref.finalize(
self,
_teardown,
self.teardown_functions,
self.handle,
)


class CtypesModule(Module):
def get_function(self, name):
Expand Down
17 changes: 16 additions & 1 deletion numba_cuda/numba/cuda/cudadrv/linkable_code.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,26 @@ class LinkableCode:
:param data: A buffer containing the data to link.
:param name: The name of the file to be referenced in any compilation or
linking errors that may be produced.
:param setup_callback: A function called prior to the launch of a kernel
contained within a module that has this code object
linked into it.
:param teardown_callback: A function called just prior to the unloading of
a module that has this code object linked into
it.
"""

def __init__(self, data, name=None):
def __init__(
self, data, name=None, setup_callback=None, teardown_callback=None
):
if setup_callback and not callable(setup_callback):
raise TypeError("setup_callback must be callable")
if teardown_callback and not callable(teardown_callback):
raise TypeError("teardown_callback must be callable")

self.data = data
self._name = name
self.setup_callback = setup_callback
self.teardown_callback = teardown_callback

@property
def name(self):
Expand Down
9 changes: 9 additions & 0 deletions numba_cuda/numba/cuda/dispatcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
)
from numba.cuda import types as cuda_types
from numba.cuda.runtime.nrt import rtsys
from numba.cuda.locks import module_init_lock

from numba import cuda
from numba import _dispatcher
Expand Down Expand Up @@ -347,12 +348,19 @@ def _reduce_states(self):
extensions=self.extensions,
)

@module_init_lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any risk where at destruction / unloading not acquiring this lock that we end up in a bad situation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm crafting a concurrent test so that we can get a more in depth view of the behavior of setup and teardown with threads. I'll report back.

Copy link
Contributor Author

@isVoid isVoid Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I delved into the multi-threading kernel launches, this is what I discovered:

  • The compilation is run once for the first thread that acquired the compilation lock, and the subsequent thread can load the same compiled binary from cache.
  • Each thread creates its own cumodule pointer and independently invokes the cumoduleLoadDataEx on their own copy of the pointer. So N modules will be created if there are N threads.

What this means for the setup callback function:

  1. Setup function is called N times, if there are N threads. Each invocation with a unique module handle.
  2. There is a lock for the setup section, the N invocations will take place in serial. No race condition.

What this means for the teardown callback function (Today):

  • Today, user cannot teardown the kernel. So we assume that all teardown happens after all threads join and the main thread takes care of the interpreter shutdown and the finalizers. Finalizers are placed in a FILO stack, so they are invoked in serial. No race condition.

What this means for the teardown callback function (If we implement #171)

[EDITED]

  • Each thread holds a reference to the kernel, so del kernel only reduces the count where each thread increments. The main thread still holds the initial reference to the kernel. In this case kernel is still finalized at interpreter shut down and the above still holds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Each thread creates its own cumodule pointer and independently invokes the cumoduleLoadDataEx on their own copy of the pointer. So N modules will be created if there are N threads.

Does calling cumoduleLoadDataEx consume valuable resources like global device memory? I would assume so.

Do we have one CUDA Context being used by multiple threads?

If the answer is yes to both of these questions, maybe we should implement a lock where the first thread acquires the lock and creates the module and then all future threads can retrieve the module from a cache? Then we presumably only need to call the setup callback function once instead of N times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Each thread creates its own cumodule pointer and independently invokes the cumoduleLoadDataEx on their own copy of the pointer. So N modules will be created if there are N threads.

Does calling cumoduleLoadDataEx consume valuable resources like global device memory? I would assume so.

Do we have one CUDA Context being used by multiple threads?

If the answer is yes to both of these questions, maybe we should implement a lock where the first thread acquires the lock and creates the module and then all future threads can retrieve the module from a cache? Then we presumably only need to call the setup callback function once instead of N times?

Yes and yes, I think Graham and I touched these questions before and we agreed there were some subtle bug with module creation in Numba. Your suggestion makes sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each thread creates its own cumodule pointer and independently invokes the cuModuleLoadDataEx on their own copy of the pointer. So N modules will be created if there are N threads.

I thought this sounded odd, and it's definitely not what we want. There should be one module per context, not per thread - I noted why in the main PR comments: #145 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to address the issue in this PR.

def initialize_once(self, mod):
if not mod.initialized:
mod.setup()

def bind(self):
"""
Force binding to current CUDA context
"""
cufunc = self._codelibrary.get_cufunc()

self.initialize_once(cufunc.module)

if (
hasattr(self, "target_context")
and self.target_context.enable_nrt
Expand Down Expand Up @@ -1103,6 +1111,7 @@ def add_overload(self, kernel, argtypes):
self._insert(c_sig, kernel, cuda=True)
self.overloads[argtypes] = kernel

@global_compiler_lock
def compile(self, sig):
"""
Compile and bind to the current context a version of this kernel
Expand Down
16 changes: 16 additions & 0 deletions numba_cuda/numba/cuda/locks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from threading import Lock
from functools import wraps

# Thread safety guard for module initialization.
_module_init_lock = Lock()


def module_init_lock(func):
"""Decorator to make sure initialization is invoked once for all threads."""

@wraps(func)
def wrapper(*args, **kwargs):
with _module_init_lock:
return func(*args, **kwargs)

return wrapper
Loading