Capture global device arrays in kernels and device functions by shwina · Pull Request #666 · NVIDIA/numba-cuda

shwina · 2025-12-15T20:01:39Z

Closes #659

copy-pr-bot · 2025-12-15T20:01:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shwina · 2025-12-15T20:02:53Z

/ok to test c17b44e

greptile-apps · 2025-12-15T20:05:15Z

Greptile Summary

Implements support for capturing global device arrays (objects implementing __cuda_array_interface__) in CUDA kernels and device functions. Device pointers are embedded as constants rather than copying data, making mutations visible across kernel calls.

Key changes:

Typing system (typeof.py) now recognizes __cuda_array_interface__ objects and determines their array type with proper layout detection
Lowering (arrayobj.py) embeds device pointers directly as constants and stores references to prevent garbage collection
Code generation (codegen.py) maintains referenced_objects dict to keep captured arrays alive
Type inference (typeinfer.py) exempts device arrays from readonly marking, preserving mutability
Serialization properly prevents caching of kernels with captured device arrays via checks in both serialize.py and codegen.py
Comprehensive test coverage for various array types, dimensions, dtypes, and mutability
Clear documentation explaining the difference between constant capture (host arrays) and pointer capture (device arrays)

Confidence Score: 5/5

This PR is safe to merge with high confidence
The implementation is well-architected with proper safeguards: device arrays are kept alive via reference counting, caching is correctly prevented to avoid stale pointers, the typing and lowering logic is sound, and comprehensive tests cover edge cases including 0-D arrays, multiple dtypes, and mutability
No files require special attention

Important Files Changed

Filename	Overview
numba_cuda/numba/cuda/typing/typeof.py	Adds `_typeof_cuda_array_interface` to handle typing of objects implementing `__cuda_array_interface__` (e.g., CuPy arrays), enabling them to be captured as globals. Layout detection logic is comprehensive and correct.
numba_cuda/numba/cuda/np/arrayobj.py	Modifies `constant_array` to detect device arrays and call `_lower_constant_device_array`, which embeds the device pointer as a constant and stores reference to prevent GC. Implementation is sound.
numba_cuda/numba/cuda/codegen.py	Adds `referenced_objects` dict to keep device arrays alive and prevents caching of kernels with captured device arrays via `_reduce_states` check. Prevents invalid pointer serialization.
numba_cuda/numba/cuda/core/typeinfer.py	Exempts device arrays implementing `__cuda_array_interface__` from readonly marking, allowing them to be mutable when captured as globals. Clean, minimal change.
numba_cuda/numba/cuda/serialize.py	Adds check in `NumbaPickler.reducer_override` to prevent pickling objects with `__cuda_array_interface__`, raising clear error message about invalid device pointers.
numba_cuda/numba/cuda/np/numpy_support.py	Adds `strides_from_shape` utility function to compute array strides for C/F-contiguous layouts. Correctly handles 0-D arrays. Used in device array lowering.

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

cpcloud

Looks good.

I'd like to see a couple changes on the hasattr use and computing strides.

numba_cuda/numba/cuda/np/arrayobj.py

numba_cuda/numba/cuda/typing/typeof.py

greptile-apps

Additional Comments (1)

numba_cuda/numba/cuda/np/arrayobj.py, line 3674-3683 (link)

logic: strides are calculated in reverse order - for shape (3, 4, 5) with itemsize 4, this produces (4, 20, 80) instead of correct C-contiguous strides (80, 20, 4), causing incorrect memory access for multidimensional arrays

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

numba_cuda/numba/cuda/typing/typeof.py

shwina · 2025-12-16T15:23:40Z

/ok to test 11bc98f

greptile-apps

Additional Comments (1)

numba_cuda/numba/cuda/np/numpy_support.py, line 24-40 (link)

logic: edge case: 0D arrays (shape ()) will return (itemsize,) instead of (). For C-order with shape=(), shape[1:] gives (), reversed is [], accumulate yields [1], multiply by itemsize gives [itemsize], and reverse gives (itemsize,) instead of the expected ().

however, this is unlikely to affect real usage since:
1. in typeof.py:332-333, 0D arrays return early with layout "C" without calling this function
2. in arrayobj.py:3683-3684, this is only called when strides is None, which is rare for 0D device arrays
3. 0D CUDA arrays are uncommon in practice

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

shwina · 2025-12-16T15:31:35Z

/ok to test 98799f8

cpcloud

LGTM, ship it!

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

gmarkall · 2025-12-16T15:38:18Z

numba_cuda/numba/cuda/tests/cudapy/test_device_array_capture.py

+Tests for capturing device arrays (objects implementing __cuda_array_interface__)
+from global scope in CUDA kernels and device functions.
+
+This tests the capture of third-party arrays (like CuPy) that implement


The docs say this works for Numba device arrays but this implies that it doesn't (because Numba device arrays have _numba_type_). I think this docstring rather than the documentation is incorrect - is that right?

I've fixed this module docstring to say the correct thing.

I also replaced the use of CuPy with ForeignArray, and extended the tests to use both DeviceNDArray and ForeignArray.

gmarkall · 2025-12-16T15:41:49Z

numba_cuda/numba/cuda/np/arrayobj.py

+    interface = pyval.__cuda_array_interface__
+
+    # Hold on to the device-array-like object to prevent garbage collection.
+    # The code library maintains a dictionary of referenced objects.


Since you showed (elsewhere) that the method I suggested to prevent pickling (patching the active code library to raise an error when it is serialized) doesn't work, I realised that this method for keeping the arrays alive will also not work - the code library that needs to survive is the one associated with the kernel, not a device function that referenced the array. So I think this will also not work for keeping referenced arrays alive.

gmarkall · 2025-12-16T15:44:24Z

numba_cuda/numba/cuda/np/arrayobj.py

+    lib = context.active_code_library
+    referenced_objects = getattr(lib, "referenced_objects", None)
+    if referenced_objects is None:
+        lib.referenced_objects = referenced_objects = {}


This approach made sense for an external library like CCCL. For Numba-CUDA itself, we should initialize lib.referenced_objects in CUDACodeLibrary.__init__ instead.

When a code library is linked into another code library, we should also add the referenced objects of the library we're linking in to the current one. This should resolve the issue I mentioned above about the code library of a kernel not necessarily holding references to the device arrays it will access.

Implemented this solution in 54d5fb3

shwina · 2025-12-16T16:30:01Z

/ok to test 54d5fb3

shwina · 2025-12-16T16:32:32Z

/ok to test d4db14a

greptile-apps

_{9 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

…ist and tuple)

shwina · 2025-12-16T16:49:33Z

/ok to test 8d20e24

shwina · 2025-12-16T16:50:45Z

/ok to test 8d20e24

greptile-apps

Additional Comments (1)

numba_cuda/numba/cuda/np/numpy_support.py, line 24-40 (link)

logic: strides_from_shape returns incorrect strides for 0-D arrays (shape ()). For a 0-D array, it returns (itemsize,) instead of (). This could cause issues if a 0-D device array is captured.

_{9 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

gmarkall

I started reviewing this but picked up on a couple of issues so I wanted to submit the review so you could look into the comments rather than waiting until I looked everything over.

gmarkall · 2025-12-16T20:16:51Z

docs/source/user/globals.rst

The examples in this file look like they should be copy-pastable to execute, but they don't run because of missing declarations. Should they be expected to work? (If they are, it may be good to convert them to doctests, like e.g https://github.com/NVIDIA/numba-cuda/blob/main/numba_cuda/numba/cuda/tests/doc_examples/test_random.py / https://github.com/NVIDIA/numba-cuda/blob/main/docs/source/user/examples.rst?plain=1)

They were meant to be representative/illustrative. If you think it's useful, I'll replace them with real doctests; thanks!

gmarkall · 2025-12-16T20:29:18Z

numba_cuda/numba/cuda/tests/cudapy/test_device_array_capture.py

+    __cuda_array_interface__, preventing this issue.
+    """
+
+    def test_caching_rejects_captured_pointer(self):


If I run the test code from this test outside of the test suite:

from numba import cuda import numpy as np host_data = np.array([1.0, 2.0, 3.0], dtype=np.float32) captured_arr = cuda.to_device(host_data) @cuda.jit(cache=True) def cached_kernel(output): i = cuda.grid(1) if i < output.size: output[i] = captured_arr[i] * 2.0 output = cuda.device_array(3, dtype=np.float32) cached_kernel[1, 3](output) print(output.copy_to_host())

then it still caches the kernel, and the run fails the second time round:

Traceback (most recent call last): File "/home/gmarkall/numbadev/issues/numba-cuda-666/test_caching.py", line 17, in <module> print(output.copy_to_host()) ^^^^^^^^^^^^^^^^^^^^^ File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/devices.py", line 233, in _require_cuda_context return fn(*args, **kws) ^^^^^^^^^^^^^^^^ File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/devicearray.py", line 272, in copy_to_host _driver.device_to_host( File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/driver.py", line 2708, in device_to_host fn(*args) File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/driver.py", line 358, in safe_cuda_api_call return self._check_cuda_python_error(fname, libfn(*args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/driver.py", line 417, in _check_cuda_python_error raise CudaAPIError(retcode, msg) numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemcpyDtoH results in CUDA_ERROR_ILLEGAL_ADDRESS

I don't think this test is working correctly (either the code is not running or the assertion is not firing) and the code that is intended to prevent caching is not preventing caching.

See my comment below; the issue is that we handle closure variables correctly (and test appropriately), but not globals.

gmarkall · 2025-12-16T20:31:44Z

numba_cuda/numba/cuda/serialize.py

        if type(obj) in self.disabled_types:
            _no_pickle(obj)  # noreturn
+
+        # Prevent pickling of objects implementing __cuda_array_interface__


I think this has no effect because it never gets to see an object with the CUDA Array Interface when pickling. The _reduce_states() method of the CUDACodeLibrary has no reference to referenced_objects, so the serialization erroneously succeeds.

To prevent caching, the CUDACodeLibrary needs to detect that it holds referenced objects:

diff --git a/numba_cuda/numba/cuda/codegen.py b/numba_cuda/numba/cuda/codegen.py index 957dd72e..9ee91e29 100644 --- a/numba_cuda/numba/cuda/codegen.py +++ b/numba_cuda/numba/cuda/codegen.py @@ -463,6 +463,10 @@ class CUDACodeLibrary(serialize.ReduceMixin, CodeLibrary): if not self._finalized: raise RuntimeError("Cannot pickle unfinalized CUDACodeLibrary") + + if self.referenced_objects: + raise RuntimeError("Cannot pickle...") + return dict( codegen=None, name=self.name,

I see -- the issue is that for closure variables, we are able to raise this PicklingError correctly - this happens starting at

numba-cuda/numba_cuda/numba/cuda/core/caching.py

Line 338 in dd396c8

cvarbytes = dumps(cvars)

. The tests I have currently use closure variables (not globals).

I followed your suggestion to correctly handle global device arrays in CUDACodeLibrary. Now we raise a PicklingError for both cases.

acce40d

…rrays.

shwina · 2025-12-17T00:05:42Z

/ok to test acce40d

greptile-apps

Additional Comments (1)

numba_cuda/numba/cuda/np/numpy_support.py, line 24-40 (link)

logic: Potential issue with 0-dimensional arrays: when shape = (), this function returns (itemsize,) instead of ().

For a 0-d array, shape[limits] is (), but itertools.accumulate with initial=1 produces (1,), which gets multiplied by itemsize to give (itemsize,).

Check if 0-d device arrays are a valid use case for __cuda_array_interface__. If so, add special handling:

_{9 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

shwina · 2025-12-17T00:07:36Z

/ok to test 87ee0fe

greptile-apps

_{9 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

gmarkall · 2025-12-17T10:54:41Z

numba_cuda/numba/cuda/serialize.py

-                f"pointers that cannot be safely serialized. "
-                f"Disable caching (cache=False) for kernels that capture "
-                f"device arrays from global scope."
+                "Cannot cache kernels or device functions referencing "


If Numba-CUDA is used with a distributed runtime (like Ray) that serializes kernels to send them to remote workers, then the error message will reference caching ("Cannot cache...") rather than pickling ("Cannot pickle..."). Do you think this could be confusing? Should we stick to referring to pickling here?

gmarkall · 2025-12-17T12:01:02Z

A quick summary (pulling in discussions on this PR and Slack) of items I think are outstanding / noted:

Modify the caching tests to use the same form as test_caching.py (possibly moving them there for ease of reuse of caching test infrastructure).
Keep the PicklingError in CUDACodeLibrary.reduce_states(). The other error case raises RuntimeError, but that was a mistake to have used that exception class, so we should not copy it just for consistency.
Potentially add a test with a 0D array (Greptile seems to think this will be an issue).
Potentially reword exception messages, due to the serialization use cases that are not caching.
Resolution of the test_tanhf_compile_ptx issue (I do not yet know why this is occurring).

shwina · 2025-12-17T14:38:23Z

Modify the caching tests to use the same form as test_caching.py (possibly moving them there for ease of reuse of caching test infrastructure)

See 2c3ead1

Potentially add a test with a 0D array (Greptile seems to think this will be an issue).

See f848c0a

Potentially reword exception messages, due to the serialization use cases that are not caching.

See 8313619

Add doctests (from comment here)

See 2630d25

With regards to:

Resolution of the test_tanhf_compile_ptx issue (I do not yet know why this is occurring).

I wonder if mucking about with the caching infrastructure changes was causing this; especially when tests run in a certain order. Anyway, I'm hoping that with 2c3ead1 this is resolved 🤞

greptile-apps

Additional Comments (1)

numba_cuda/numba/cuda/np/numpy_support.py, line 24-40 (link)

logic: strides_from_shape returns (itemsize,) for 0-D arrays instead of (). Currently not triggered due to caller checks at arrayobj.py:3679 and typeof.py:331, but could cause issues if called directly.

_{11 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

shwina · 2025-12-17T14:42:02Z

/ok to test f848c0a

gmarkall

Looks good, assuming CI passes. I checked the rendering of the doctest code in the new documentation page locally.

- Capture global device arrays in kernels and device functions (#666) - Fix #624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (#643) - Fix Issue #588: separate compilation of NVVM IR modules when generating debuginfo (#591) - feat: allow printing nested tuples (#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (#655) - build(deps): bump actions/upload-artifact from 4 to 5 (#652) - Test RAPIDS 25.12 (#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (#662) - feat: add print support for int64 tuples (#663) - Only run dependabot monthly and open fewer PRs (#658) - test: fix bogus `self` argument to `Context` (#656) - Fix false negative NRT link decision when NRT was previously toggled on (#650) - Add support for dependabot (#647) - refactor: cull dead linker objects (#649) - Migrate numba-cuda driver to use cuda.core.launch API (#609) - feat: add set_shared_memory_carveout (#629) - chore: bump version in pixi.toml (#641) - refactor: remove devicearray code to reduce complexity (#600)

v0.23.0 - Capture global device arrays in kernels and device functions (NVIDIA#666) - Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643) - Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591) - feat: allow printing nested tuples (NVIDIA#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655) - build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652) - Test RAPIDS 25.12 (NVIDIA#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662) - feat: add print support for int64 tuples (NVIDIA#663) - Only run dependabot monthly and open fewer PRs (NVIDIA#658) - test: fix bogus `self` argument to `Context` (NVIDIA#656) - Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650) - Add support for dependabot (NVIDIA#647) - refactor: cull dead linker objects (NVIDIA#649) - Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609) - feat: add set_shared_memory_carveout (NVIDIA#629) - chore: bump version in pixi.toml (NVIDIA#641) - refactor: remove devicearray code to reduce complexity (NVIDIA#600)

Capture global device arrays in kernels and device functions

c17b44e

greptile-apps bot reviewed Dec 15, 2025

View reviewed changes

cpcloud requested changes Dec 15, 2025

View reviewed changes

numba_cuda/numba/cuda/np/arrayobj.py Outdated Show resolved Hide resolved

numba_cuda/numba/cuda/np/arrayobj.py Outdated Show resolved Hide resolved

numba_cuda/numba/cuda/typing/typeof.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Dec 16, 2025

View reviewed changes

Use numpy.prod for strides calculation, don't use hasattr

a7f451c

shwina force-pushed the global-arrays branch from 4c316e7 to a7f451c Compare December 16, 2025 10:28

greptile-apps bot reviewed Dec 16, 2025

View reviewed changes

gmarkall added the 3 - Ready for Review Ready for review by team label Dec 16, 2025

cpcloud reviewed Dec 16, 2025

View reviewed changes

numba_cuda/numba/cuda/typing/typeof.py Outdated Show resolved Hide resolved

shwina added 2 commits December 16, 2025 10:22

Introduce strides_from_shape helper and use that in numpy_support

c65eaef

Add documentation describing global array capture

11bc98f

greptile-apps bot reviewed Dec 16, 2025

View reviewed changes

Try making circular import go away

98799f8

cpcloud approved these changes Dec 16, 2025

View reviewed changes

greptile-apps bot reviewed Dec 16, 2025

View reviewed changes

gmarkall reviewed Dec 16, 2025

View reviewed changes

shwina added 2 commits December 16, 2025 11:13

Rewrite tests using both device arrays and foreign arrays

0a0b55b

Introduce referenced_objects attribute to CodeLibrary and use it

54d5fb3

Remove unnecessary headers

d4db14a

A few smaller fixes

f7de600

greptile-apps bot reviewed Dec 16, 2025

View reviewed changes

shwina added 2 commits December 16, 2025 11:45

ExternalCodeLibrary does not have referenced_objects attribute

21148d3

Use elementwise comparison since lhs, rhs could be different types (l…

8d20e24

…ist and tuple)

greptile-apps bot reviewed Dec 16, 2025

View reviewed changes

gmarkall requested changes Dec 16, 2025

View reviewed changes

Correctly raise PicklingError for kernels referencing global device a…

acce40d

…rrays.

greptile-apps bot reviewed Dec 17, 2025

View reviewed changes

Remove extraneous comments.

87ee0fe

greptile-apps bot reviewed Dec 17, 2025

View reviewed changes

gmarkall reviewed Dec 17, 2025

View reviewed changes

shwina added 3 commits December 17, 2025 07:48

Do not mention cache=False in PicklingError

8313619

Use real doctest examples for global arrays section

2630d25

Move caching tests to test_caching.py

2c3ead1

greptile-apps bot reviewed Dec 17, 2025

View reviewed changes

Fix capture for 0-d arrays and add test

f848c0a

shwina force-pushed the global-arrays branch from 9dbf9a8 to f848c0a Compare December 17, 2025 14:40

gmarkall approved these changes Dec 17, 2025

View reviewed changes

gmarkall merged commit 09715e3 into NVIDIA:main Dec 17, 2025
72 checks passed

This was referenced Dec 18, 2025

Support operations with side-effects (state) in cuda.compute NVIDIA/cccl#7008

Merged

Add support for ops that capture device arrays from enclosing scope NVIDIA/cccl#6963

Closed

shwina mentioned this pull request Jan 26, 2026

fix: temporarily use unary_transform instead of segmented_reduce scikit-hep/awkward#3814

Merged

isVoid mentioned this pull request Feb 11, 2026

Create Highlevel Bindings for FP8 Datatype #788

Open

Conversation

shwina commented Dec 15, 2025

Uh oh!

copy-pr-bot bot commented Dec 15, 2025

Uh oh!

shwina commented Dec 15, 2025

Uh oh!

greptile-apps bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

cpcloud left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shwina commented Dec 16, 2025

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

shwina commented Dec 16, 2025

Uh oh!

cpcloud left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shwina commented Dec 16, 2025

Uh oh!

shwina commented Dec 16, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

shwina commented Dec 16, 2025

Uh oh!

shwina commented Dec 16, 2025

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

gmarkall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

greptile-apps bot commented Dec 15, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

shwina Dec 17, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

shwina commented Dec 17, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading