Capture global device arrays in kernels and device functions#666
Capture global device arrays in kernels and device functions#666gmarkall merged 17 commits intoNVIDIA:mainfrom
Conversation
|
/ok to test c17b44e |
Greptile SummaryImplements support for capturing global device arrays (objects implementing Key changes:
Confidence Score: 5/5
Important Files Changed
|
cpcloud
left a comment
There was a problem hiding this comment.
Looks good.
I'd like to see a couple changes on the hasattr use and computing strides.
There was a problem hiding this comment.
Additional Comments (1)
-
numba_cuda/numba/cuda/np/arrayobj.py, line 3674-3683 (link)logic: strides are calculated in reverse order - for shape
(3, 4, 5)with itemsize 4, this produces(4, 20, 80)instead of correct C-contiguous strides(80, 20, 4), causing incorrect memory access for multidimensional arrays
5 files reviewed, 1 comment
4c316e7 to
a7f451c
Compare
|
/ok to test 11bc98f |
There was a problem hiding this comment.
Additional Comments (1)
-
numba_cuda/numba/cuda/np/numpy_support.py, line 24-40 (link)logic: edge case: 0D arrays (shape
()) will return(itemsize,)instead of(). For C-order withshape=(),shape[1:]gives(), reversed is[], accumulate yields[1], multiply by itemsize gives[itemsize], and reverse gives(itemsize,)instead of the expected().however, this is unlikely to affect real usage since:
- in
typeof.py:332-333, 0D arrays return early with layout "C" without calling this function - in
arrayobj.py:3683-3684, this is only called whenstrides is None, which is rare for 0D device arrays - 0D CUDA arrays are uncommon in practice
- in
8 files reviewed, 1 comment
|
/ok to test 98799f8 |
| Tests for capturing device arrays (objects implementing __cuda_array_interface__) | ||
| from global scope in CUDA kernels and device functions. | ||
|
|
||
| This tests the capture of third-party arrays (like CuPy) that implement |
There was a problem hiding this comment.
The docs say this works for Numba device arrays but this implies that it doesn't (because Numba device arrays have _numba_type_). I think this docstring rather than the documentation is incorrect - is that right?
There was a problem hiding this comment.
I've fixed this module docstring to say the correct thing.
I also replaced the use of CuPy with ForeignArray, and extended the tests to use both DeviceNDArray and ForeignArray.
numba_cuda/numba/cuda/np/arrayobj.py
Outdated
| interface = pyval.__cuda_array_interface__ | ||
|
|
||
| # Hold on to the device-array-like object to prevent garbage collection. | ||
| # The code library maintains a dictionary of referenced objects. |
There was a problem hiding this comment.
Since you showed (elsewhere) that the method I suggested to prevent pickling (patching the active code library to raise an error when it is serialized) doesn't work, I realised that this method for keeping the arrays alive will also not work - the code library that needs to survive is the one associated with the kernel, not a device function that referenced the array. So I think this will also not work for keeping referenced arrays alive.
numba_cuda/numba/cuda/np/arrayobj.py
Outdated
| lib = context.active_code_library | ||
| referenced_objects = getattr(lib, "referenced_objects", None) | ||
| if referenced_objects is None: | ||
| lib.referenced_objects = referenced_objects = {} |
There was a problem hiding this comment.
This approach made sense for an external library like CCCL. For Numba-CUDA itself, we should initialize lib.referenced_objects in CUDACodeLibrary.__init__ instead.
When a code library is linked into another code library, we should also add the referenced objects of the library we're linking in to the current one. This should resolve the issue I mentioned above about the code library of a kernel not necessarily holding references to the device arrays it will access.
|
/ok to test 54d5fb3 |
|
/ok to test d4db14a |
|
/ok to test 8d20e24 |
|
/ok to test 8d20e24 |
There was a problem hiding this comment.
Additional Comments (1)
-
numba_cuda/numba/cuda/np/numpy_support.py, line 24-40 (link)logic:
strides_from_shapereturns incorrect strides for 0-D arrays (shape()). For a 0-D array, it returns(itemsize,)instead of(). This could cause issues if a 0-D device array is captured.
9 files reviewed, 1 comment
gmarkall
left a comment
There was a problem hiding this comment.
I started reviewing this but picked up on a couple of issues so I wanted to submit the review so you could look into the comments rather than waiting until I looked everything over.
There was a problem hiding this comment.
The examples in this file look like they should be copy-pastable to execute, but they don't run because of missing declarations. Should they be expected to work? (If they are, it may be good to convert them to doctests, like e.g https://github.com/NVIDIA/numba-cuda/blob/main/numba_cuda/numba/cuda/tests/doc_examples/test_random.py / https://github.com/NVIDIA/numba-cuda/blob/main/docs/source/user/examples.rst?plain=1)
There was a problem hiding this comment.
They were meant to be representative/illustrative. If you think it's useful, I'll replace them with real doctests; thanks!
| __cuda_array_interface__, preventing this issue. | ||
| """ | ||
|
|
||
| def test_caching_rejects_captured_pointer(self): |
There was a problem hiding this comment.
If I run the test code from this test outside of the test suite:
from numba import cuda
import numpy as np
host_data = np.array([1.0, 2.0, 3.0], dtype=np.float32)
captured_arr = cuda.to_device(host_data)
@cuda.jit(cache=True)
def cached_kernel(output):
i = cuda.grid(1)
if i < output.size:
output[i] = captured_arr[i] * 2.0
output = cuda.device_array(3, dtype=np.float32)
cached_kernel[1, 3](output)
print(output.copy_to_host())then it still caches the kernel, and the run fails the second time round:
Traceback (most recent call last):
File "/home/gmarkall/numbadev/issues/numba-cuda-666/test_caching.py", line 17, in <module>
print(output.copy_to_host())
^^^^^^^^^^^^^^^^^^^^^
File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/devices.py", line 233, in _require_cuda_context
return fn(*args, **kws)
^^^^^^^^^^^^^^^^
File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/devicearray.py", line 272, in copy_to_host
_driver.device_to_host(
File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/driver.py", line 2708, in device_to_host
fn(*args)
File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/driver.py", line 358, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/driver.py", line 417, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemcpyDtoH results in CUDA_ERROR_ILLEGAL_ADDRESS
I don't think this test is working correctly (either the code is not running or the assertion is not firing) and the code that is intended to prevent caching is not preventing caching.
There was a problem hiding this comment.
See my comment below; the issue is that we handle closure variables correctly (and test appropriately), but not globals.
| if type(obj) in self.disabled_types: | ||
| _no_pickle(obj) # noreturn | ||
|
|
||
| # Prevent pickling of objects implementing __cuda_array_interface__ |
There was a problem hiding this comment.
I think this has no effect because it never gets to see an object with the CUDA Array Interface when pickling. The _reduce_states() method of the CUDACodeLibrary has no reference to referenced_objects, so the serialization erroneously succeeds.
To prevent caching, the CUDACodeLibrary needs to detect that it holds referenced objects:
diff --git a/numba_cuda/numba/cuda/codegen.py b/numba_cuda/numba/cuda/codegen.py
index 957dd72e..9ee91e29 100644
--- a/numba_cuda/numba/cuda/codegen.py
+++ b/numba_cuda/numba/cuda/codegen.py
@@ -463,6 +463,10 @@ class CUDACodeLibrary(serialize.ReduceMixin, CodeLibrary):
if not self._finalized:
raise RuntimeError("Cannot pickle unfinalized CUDACodeLibrary")
+
+ if self.referenced_objects:
+ raise RuntimeError("Cannot pickle...")
+
return dict(
codegen=None,
name=self.name,There was a problem hiding this comment.
I see -- the issue is that for closure variables, we are able to raise this PicklingError correctly - this happens starting at
I followed your suggestion to correctly handle global device arrays in CUDACodeLibrary. Now we raise a PicklingError for both cases.
|
/ok to test acce40d |
There was a problem hiding this comment.
Additional Comments (1)
-
numba_cuda/numba/cuda/np/numpy_support.py, line 24-40 (link)logic: Potential issue with 0-dimensional arrays: when
shape = (), this function returns(itemsize,)instead of().For a 0-d array,
shape[limits]is(), butitertools.accumulatewithinitial=1produces(1,), which gets multiplied by itemsize to give(itemsize,).Check if 0-d device arrays are a valid use case for
__cuda_array_interface__. If so, add special handling:
9 files reviewed, 1 comment
|
/ok to test 87ee0fe |
numba_cuda/numba/cuda/serialize.py
Outdated
| f"pointers that cannot be safely serialized. " | ||
| f"Disable caching (cache=False) for kernels that capture " | ||
| f"device arrays from global scope." | ||
| "Cannot cache kernels or device functions referencing " |
There was a problem hiding this comment.
If Numba-CUDA is used with a distributed runtime (like Ray) that serializes kernels to send them to remote workers, then the error message will reference caching ("Cannot cache...") rather than pickling ("Cannot pickle..."). Do you think this could be confusing? Should we stick to referring to pickling here?
|
A quick summary (pulling in discussions on this PR and Slack) of items I think are outstanding / noted:
|
See 2c3ead1
See f848c0a
See 8313619
See 2630d25 With regards to:
I wonder if mucking about with the caching infrastructure changes was causing this; especially when tests run in a certain order. Anyway, I'm hoping that with 2c3ead1 this is resolved 🤞 |
There was a problem hiding this comment.
Additional Comments (1)
-
numba_cuda/numba/cuda/np/numpy_support.py, line 24-40 (link)logic:
strides_from_shapereturns(itemsize,)for 0-D arrays instead of(). Currently not triggered due to caller checks atarrayobj.py:3679andtypeof.py:331, but could cause issues if called directly.
11 files reviewed, 1 comment
9dbf9a8 to
f848c0a
Compare
|
/ok to test f848c0a |
gmarkall
left a comment
There was a problem hiding this comment.
Looks good, assuming CI passes. I checked the rendering of the doctest code in the new documentation page locally.
- Capture global device arrays in kernels and device functions (#666) - Fix #624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (#643) - Fix Issue #588: separate compilation of NVVM IR modules when generating debuginfo (#591) - feat: allow printing nested tuples (#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (#655) - build(deps): bump actions/upload-artifact from 4 to 5 (#652) - Test RAPIDS 25.12 (#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (#662) - feat: add print support for int64 tuples (#663) - Only run dependabot monthly and open fewer PRs (#658) - test: fix bogus `self` argument to `Context` (#656) - Fix false negative NRT link decision when NRT was previously toggled on (#650) - Add support for dependabot (#647) - refactor: cull dead linker objects (#649) - Migrate numba-cuda driver to use cuda.core.launch API (#609) - feat: add set_shared_memory_carveout (#629) - chore: bump version in pixi.toml (#641) - refactor: remove devicearray code to reduce complexity (#600)
v0.23.0 - Capture global device arrays in kernels and device functions (NVIDIA#666) - Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643) - Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591) - feat: allow printing nested tuples (NVIDIA#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655) - build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652) - Test RAPIDS 25.12 (NVIDIA#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662) - feat: add print support for int64 tuples (NVIDIA#663) - Only run dependabot monthly and open fewer PRs (NVIDIA#658) - test: fix bogus `self` argument to `Context` (NVIDIA#656) - Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650) - Add support for dependabot (NVIDIA#647) - refactor: cull dead linker objects (NVIDIA#649) - Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609) - feat: add set_shared_memory_carveout (NVIDIA#629) - chore: bump version in pixi.toml (NVIDIA#641) - refactor: remove devicearray code to reduce complexity (NVIDIA#600)
Closes #659