Skip to content

Capture global device arrays in kernels and device functions#666

Merged
gmarkall merged 17 commits intoNVIDIA:mainfrom
shwina:global-arrays
Dec 17, 2025
Merged

Capture global device arrays in kernels and device functions#666
gmarkall merged 17 commits intoNVIDIA:mainfrom
shwina:global-arrays

Conversation

@shwina
Copy link
Contributor

@shwina shwina commented Dec 15, 2025

Closes #659

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 15, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shwina
Copy link
Contributor Author

shwina commented Dec 15, 2025

/ok to test c17b44e

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 15, 2025

Greptile Summary

Implements support for capturing global device arrays (objects implementing __cuda_array_interface__) in CUDA kernels and device functions. Device pointers are embedded as constants rather than copying data, making mutations visible across kernel calls.

Key changes:

  • Typing system (typeof.py) now recognizes __cuda_array_interface__ objects and determines their array type with proper layout detection
  • Lowering (arrayobj.py) embeds device pointers directly as constants and stores references to prevent garbage collection
  • Code generation (codegen.py) maintains referenced_objects dict to keep captured arrays alive
  • Type inference (typeinfer.py) exempts device arrays from readonly marking, preserving mutability
  • Serialization properly prevents caching of kernels with captured device arrays via checks in both serialize.py and codegen.py
  • Comprehensive test coverage for various array types, dimensions, dtypes, and mutability
  • Clear documentation explaining the difference between constant capture (host arrays) and pointer capture (device arrays)

Confidence Score: 5/5

  • This PR is safe to merge with high confidence
  • The implementation is well-architected with proper safeguards: device arrays are kept alive via reference counting, caching is correctly prevented to avoid stale pointers, the typing and lowering logic is sound, and comprehensive tests cover edge cases including 0-D arrays, multiple dtypes, and mutability
  • No files require special attention

Important Files Changed

Filename Overview
numba_cuda/numba/cuda/typing/typeof.py Adds _typeof_cuda_array_interface to handle typing of objects implementing __cuda_array_interface__ (e.g., CuPy arrays), enabling them to be captured as globals. Layout detection logic is comprehensive and correct.
numba_cuda/numba/cuda/np/arrayobj.py Modifies constant_array to detect device arrays and call _lower_constant_device_array, which embeds the device pointer as a constant and stores reference to prevent GC. Implementation is sound.
numba_cuda/numba/cuda/codegen.py Adds referenced_objects dict to keep device arrays alive and prevents caching of kernels with captured device arrays via _reduce_states check. Prevents invalid pointer serialization.
numba_cuda/numba/cuda/core/typeinfer.py Exempts device arrays implementing __cuda_array_interface__ from readonly marking, allowing them to be mutable when captured as globals. Clean, minimal change.
numba_cuda/numba/cuda/serialize.py Adds check in NumbaPickler.reducer_override to prevent pickling objects with __cuda_array_interface__, raising clear error message about invalid device pointers.
numba_cuda/numba/cuda/np/numpy_support.py Adds strides_from_shape utility function to compute array strides for C/F-contiguous layouts. Correctly handles 0-D arrays. Used in device array lowering.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I'd like to see a couple changes on the hasattr use and computing strides.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. numba_cuda/numba/cuda/np/arrayobj.py, line 3674-3683 (link)

    logic: strides are calculated in reverse order - for shape (3, 4, 5) with itemsize 4, this produces (4, 20, 80) instead of correct C-contiguous strides (80, 20, 4), causing incorrect memory access for multidimensional arrays

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@gmarkall gmarkall added the 3 - Ready for Review Ready for review by team label Dec 16, 2025
@shwina
Copy link
Contributor Author

shwina commented Dec 16, 2025

/ok to test 11bc98f

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. numba_cuda/numba/cuda/np/numpy_support.py, line 24-40 (link)

    logic: edge case: 0D arrays (shape ()) will return (itemsize,) instead of (). For C-order with shape=(), shape[1:] gives (), reversed is [], accumulate yields [1], multiply by itemsize gives [itemsize], and reverse gives (itemsize,) instead of the expected ().

    however, this is unlikely to affect real usage since:

    1. in typeof.py:332-333, 0D arrays return early with layout "C" without calling this function
    2. in arrayobj.py:3683-3684, this is only called when strides is None, which is rare for 0D device arrays
    3. 0D CUDA arrays are uncommon in practice

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@shwina
Copy link
Contributor Author

shwina commented Dec 16, 2025

/ok to test 98799f8

Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, ship it!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Tests for capturing device arrays (objects implementing __cuda_array_interface__)
from global scope in CUDA kernels and device functions.

This tests the capture of third-party arrays (like CuPy) that implement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs say this works for Numba device arrays but this implies that it doesn't (because Numba device arrays have _numba_type_). I think this docstring rather than the documentation is incorrect - is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed this module docstring to say the correct thing.

I also replaced the use of CuPy with ForeignArray, and extended the tests to use both DeviceNDArray and ForeignArray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interface = pyval.__cuda_array_interface__

# Hold on to the device-array-like object to prevent garbage collection.
# The code library maintains a dictionary of referenced objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you showed (elsewhere) that the method I suggested to prevent pickling (patching the active code library to raise an error when it is serialized) doesn't work, I realised that this method for keeping the arrays alive will also not work - the code library that needs to survive is the one associated with the kernel, not a device function that referenced the array. So I think this will also not work for keeping referenced arrays alive.

lib = context.active_code_library
referenced_objects = getattr(lib, "referenced_objects", None)
if referenced_objects is None:
lib.referenced_objects = referenced_objects = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach made sense for an external library like CCCL. For Numba-CUDA itself, we should initialize lib.referenced_objects in CUDACodeLibrary.__init__ instead.

When a code library is linked into another code library, we should also add the referenced objects of the library we're linking in to the current one. This should resolve the issue I mentioned above about the code library of a kernel not necessarily holding references to the device arrays it will access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this solution in 54d5fb3

@shwina
Copy link
Contributor Author

shwina commented Dec 16, 2025

/ok to test 54d5fb3

@shwina
Copy link
Contributor Author

shwina commented Dec 16, 2025

/ok to test d4db14a

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@shwina
Copy link
Contributor Author

shwina commented Dec 16, 2025

/ok to test 8d20e24

@shwina
Copy link
Contributor Author

shwina commented Dec 16, 2025

/ok to test 8d20e24

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. numba_cuda/numba/cuda/np/numpy_support.py, line 24-40 (link)

    logic: strides_from_shape returns incorrect strides for 0-D arrays (shape ()). For a 0-D array, it returns (itemsize,) instead of (). This could cause issues if a 0-D device array is captured.

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started reviewing this but picked up on a couple of issues so I wanted to submit the review so you could look into the comments rather than waiting until I looked everything over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples in this file look like they should be copy-pastable to execute, but they don't run because of missing declarations. Should they be expected to work? (If they are, it may be good to convert them to doctests, like e.g https://github.com/NVIDIA/numba-cuda/blob/main/numba_cuda/numba/cuda/tests/doc_examples/test_random.py / https://github.com/NVIDIA/numba-cuda/blob/main/docs/source/user/examples.rst?plain=1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were meant to be representative/illustrative. If you think it's useful, I'll replace them with real doctests; thanks!

__cuda_array_interface__, preventing this issue.
"""

def test_caching_rejects_captured_pointer(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I run the test code from this test outside of the test suite:

from numba import cuda
import numpy as np

host_data = np.array([1.0, 2.0, 3.0], dtype=np.float32)
captured_arr = cuda.to_device(host_data)

@cuda.jit(cache=True)
def cached_kernel(output):
    i = cuda.grid(1)
    if i < output.size:
        output[i] = captured_arr[i] * 2.0

output = cuda.device_array(3, dtype=np.float32)

cached_kernel[1, 3](output)

print(output.copy_to_host())

then it still caches the kernel, and the run fails the second time round:

Traceback (most recent call last):
  File "/home/gmarkall/numbadev/issues/numba-cuda-666/test_caching.py", line 17, in <module>
    print(output.copy_to_host())
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/devices.py", line 233, in _require_cuda_context
    return fn(*args, **kws)
           ^^^^^^^^^^^^^^^^
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/devicearray.py", line 272, in copy_to_host
    _driver.device_to_host(
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/driver.py", line 2708, in device_to_host
    fn(*args)
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/driver.py", line 358, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/driver.py", line 417, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemcpyDtoH results in CUDA_ERROR_ILLEGAL_ADDRESS

I don't think this test is working correctly (either the code is not running or the assertion is not firing) and the code that is intended to prevent caching is not preventing caching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below; the issue is that we handle closure variables correctly (and test appropriately), but not globals.

if type(obj) in self.disabled_types:
_no_pickle(obj) # noreturn

# Prevent pickling of objects implementing __cuda_array_interface__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has no effect because it never gets to see an object with the CUDA Array Interface when pickling. The _reduce_states() method of the CUDACodeLibrary has no reference to referenced_objects, so the serialization erroneously succeeds.

To prevent caching, the CUDACodeLibrary needs to detect that it holds referenced objects:

diff --git a/numba_cuda/numba/cuda/codegen.py b/numba_cuda/numba/cuda/codegen.py
index 957dd72e..9ee91e29 100644
--- a/numba_cuda/numba/cuda/codegen.py
+++ b/numba_cuda/numba/cuda/codegen.py
@@ -463,6 +463,10 @@ class CUDACodeLibrary(serialize.ReduceMixin, CodeLibrary):
 
         if not self._finalized:
             raise RuntimeError("Cannot pickle unfinalized CUDACodeLibrary")
+
+        if self.referenced_objects:
+            raise RuntimeError("Cannot pickle...")
+
         return dict(
             codegen=None,
             name=self.name,

Copy link
Contributor Author

@shwina shwina Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- the issue is that for closure variables, we are able to raise this PicklingError correctly - this happens starting at

cvarbytes = dumps(cvars)
. The tests I have currently use closure variables (not globals).

I followed your suggestion to correctly handle global device arrays in CUDACodeLibrary. Now we raise a PicklingError for both cases.

acce40d

@shwina
Copy link
Contributor Author

shwina commented Dec 17, 2025

/ok to test acce40d

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. numba_cuda/numba/cuda/np/numpy_support.py, line 24-40 (link)

    logic: Potential issue with 0-dimensional arrays: when shape = (), this function returns (itemsize,) instead of ().

    For a 0-d array, shape[limits] is (), but itertools.accumulate with initial=1 produces (1,), which gets multiplied by itemsize to give (itemsize,).

    Check if 0-d device arrays are a valid use case for __cuda_array_interface__. If so, add special handling:

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@shwina
Copy link
Contributor Author

shwina commented Dec 17, 2025

/ok to test 87ee0fe

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

f"pointers that cannot be safely serialized. "
f"Disable caching (cache=False) for kernels that capture "
f"device arrays from global scope."
"Cannot cache kernels or device functions referencing "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Numba-CUDA is used with a distributed runtime (like Ray) that serializes kernels to send them to remote workers, then the error message will reference caching ("Cannot cache...") rather than pickling ("Cannot pickle..."). Do you think this could be confusing? Should we stick to referring to pickling here?

@gmarkall
Copy link
Contributor

A quick summary (pulling in discussions on this PR and Slack) of items I think are outstanding / noted:

  • Modify the caching tests to use the same form as test_caching.py (possibly moving them there for ease of reuse of caching test infrastructure).
  • Keep the PicklingError in CUDACodeLibrary.reduce_states(). The other error case raises RuntimeError, but that was a mistake to have used that exception class, so we should not copy it just for consistency.
  • Potentially add a test with a 0D array (Greptile seems to think this will be an issue).
  • Potentially reword exception messages, due to the serialization use cases that are not caching.
  • Resolution of the test_tanhf_compile_ptx issue (I do not yet know why this is occurring).

@shwina
Copy link
Contributor Author

shwina commented Dec 17, 2025

Modify the caching tests to use the same form as test_caching.py (possibly moving them there for ease of reuse of caching test infrastructure)

See 2c3ead1

Potentially add a test with a 0D array (Greptile seems to think this will be an issue).

See f848c0a

Potentially reword exception messages, due to the serialization use cases that are not caching.

See 8313619

Add doctests (from comment here)

See 2630d25

With regards to:

Resolution of the test_tanhf_compile_ptx issue (I do not yet know why this is occurring).

I wonder if mucking about with the caching infrastructure changes was causing this; especially when tests run in a certain order. Anyway, I'm hoping that with 2c3ead1 this is resolved 🤞

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. numba_cuda/numba/cuda/np/numpy_support.py, line 24-40 (link)

    logic: strides_from_shape returns (itemsize,) for 0-D arrays instead of (). Currently not triggered due to caller checks at arrayobj.py:3679 and typeof.py:331, but could cause issues if called directly.

11 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@shwina
Copy link
Contributor Author

shwina commented Dec 17, 2025

/ok to test f848c0a

Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, assuming CI passes. I checked the rendering of the doctest code in the new documentation page locally.

@gmarkall gmarkall merged commit 09715e3 into NVIDIA:main Dec 17, 2025
72 checks passed
gmarkall added a commit that referenced this pull request Dec 17, 2025
- Capture global device arrays in kernels and device functions (#666)
- Fix #624: Accept Numba IR
nodes in all places Numba-CUDA IR nodes are expected
(#643)
- Fix Issue #588: separate
compilation of NVVM IR modules when generating debuginfo
(#591)
- feat: allow printing nested tuples
(#667)
- build(deps): bump actions/setup-python from 5.6.0 to 6.1.0
(#655)
- build(deps): bump actions/upload-artifact from 4 to 5
(#652)
- Test RAPIDS 25.12 (#661)
- Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests
(#662)
- feat: add print support for int64 tuples
(#663)
- Only run dependabot monthly and open fewer PRs
(#658)
- test: fix bogus `self` argument to `Context`
(#656)
- Fix false negative NRT link decision when NRT was previously toggled
on (#650)
- Add support for dependabot
(#647)
- refactor: cull dead linker objects
(#649)
- Migrate numba-cuda driver to use cuda.core.launch API
(#609)
- feat: add set_shared_memory_carveout
(#629)
- chore: bump version in pixi.toml
(#641)
- refactor: remove devicearray code to reduce complexity
(#600)
ZzEeKkAa added a commit to ZzEeKkAa/numba-cuda that referenced this pull request Jan 8, 2026
v0.23.0

- Capture global device arrays in kernels and device functions (NVIDIA#666)
- Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643)
- Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591)
- feat: allow printing nested tuples (NVIDIA#667)
- build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655)
- build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652)
- Test RAPIDS 25.12 (NVIDIA#661)
- Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662)
- feat: add print support for int64 tuples (NVIDIA#663)
- Only run dependabot monthly and open fewer PRs (NVIDIA#658)
- test: fix bogus `self` argument to `Context` (NVIDIA#656)
- Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650)
- Add support for dependabot (NVIDIA#647)
- refactor: cull dead linker objects (NVIDIA#649)
- Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609)
- feat: add set_shared_memory_carveout (NVIDIA#629)
- chore: bump version in pixi.toml (NVIDIA#641)
- refactor: remove devicearray code to reduce complexity (NVIDIA#600)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Support capture of device arrays from globals/closures by reference

3 participants