Add support for ops that capture device arrays from enclosing scope#6963
Add support for ops that capture device arrays from enclosing scope#6963shwina wants to merge 11 commits intoNVIDIA:mainfrom
Conversation
bdice
left a comment
There was a problem hiding this comment.
I think you'll want some expert review here - but I commend you on the progress so far! This is a super cool feature. Closures over device arrays in numba-cuda would be a huge win.
There was a problem hiding this comment.
I recommend hashing the resulting id value.
| return id(self._identity) | |
| return hash(id(self._identity)) |
There was a problem hiding this comment.
In CPython these two things give identical values, although spec does not require them to be:
Python 3.13.3 | packaged by conda-forge | (main, Apr 14 2025, 20:44:03) [GCC 13.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.4.0 -- An enhanced Interactive Python. Type '?' for help.
Tip: You can use LaTeX or Unicode completion, `\alpha<tab>` will insert the α symbol.
In [1]: o = list()
In [2]: id(o), hash(id(o))
Out[2]: (138960802540160, 138960802540160)
In [3]: %timeit hash(id(o))
36.7 ns ± 0.17 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [4]: %timeit id(o)
21.7 ns ± 0.909 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
|
|
||
|
|
||
| # ============================================================================= | ||
| # Step 5: Register lower_constant for our type |
There was a problem hiding this comment.
These step numbers don't align with the 1, 2, 3 at the top of the file. Maybe these comments aren't essential?
This comment has been minimized.
This comment has been minimized.
| def __init__(self, dtype, ndim, layout, readonly=False, aligned=True): | ||
| type_name = "device_array" | ||
| if readonly: | ||
| type_name = "readonly " + type_name |
There was a problem hiding this comment.
For the parent class, "unaligned" also appears in the name if aligned is False. Is there a guarantee that things of DeviceArrayType are aligned?
There was a problem hiding this comment.
Hmm, maybe this is a question for @leofang ? Do CAI (and DLPack) make any guarantees about alignment?
There was a problem hiding this comment.
I don't think either one guarantees or constrains about alignment.
There was a problem hiding this comment.
I've defaulted to align=False and added "unaligned" to the type name.
| _original_generic = _original_typeof_impl.dispatch(object) | ||
|
|
||
|
|
||
| def _patched_generic(val, c): |
There was a problem hiding this comment.
Numba-CUDA already has a mechanism for typing CAI object, but it lived in the CUDA dispatcher: https://github.com/NVIDIA/numba-cuda/blob/94321646a046ea5dd1b64e8dc078e27121552e23/numba_cuda/numba/cuda/dispatcher.py#L1626-L1643
This was because typeof came from upstream Numba, but I think we can push this check into typeof in Numba-CUDA now that it contains its own typeof implementation.
| # ============================================================================= | ||
|
|
||
|
|
||
| def _typeof_device_array(val, c): |
There was a problem hiding this comment.
When typing an argument, this will be called with c.purpose == Purpose.constant (from numba.cuda.typing.typeof.Purpose). Perhaps we don't want to affect typing of arguments - in which case we should return None when c.purpose == Purpose.argument.
There was a problem hiding this comment.
Also - I haven't read through this implementation, but it's probably faster than the implementation in Dispatcher.typeof_pyval, so we should probably try if upstreaming it into typeof_pyval to see if it makes kernel launch a bit faster for CAI arrays (cc @cpcloud)
There was a problem hiding this comment.
This looks the same as the ArrayModel, so I think we could just register it:
| @register_default(DeviceArrayType) | |
| class DeviceArrayModel(StructModel): | |
| """Data model for DeviceArrayType - same as regular Array.""" | |
| def __init__(self, dmm, fe_type): | |
| ndim = fe_type.ndim | |
| members = [ | |
| ("meminfo", types.MemInfoPointer(fe_type.dtype)), | |
| ("parent", types.pyobject), | |
| ("nitems", types.intp), | |
| ("itemsize", types.intp), | |
| ("data", types.CPointer(fe_type.dtype)), | |
| ("shape", types.UniTuple(types.intp, ndim)), | |
| ("strides", types.UniTuple(types.intp, ndim)), | |
| ] | |
| super().__init__(dmm, fe_type, members) | |
| register_default(DeviceArrayType)(ArrayModel) |
|
/ok to test 12f1899 |
|
/ok to test 8d6faa0 |
This comment has been minimized.
This comment has been minimized.
|
|
||
| interface = pyval.__cuda_array_interface__ | ||
|
|
||
| # hold on to the device-array-like object |
There was a problem hiding this comment.
@gmarkall - added your code snippet for keeping references to the arrays here.
😬 CI Workflow Results🟥 Finished in 1h 06m: Pass: 50%/48 | Total: 8h 30m | Max: 42m 18sSee results here. |
| itemsize = np.dtype(interface["typestr"]).itemsize | ||
| # Check C-contiguous | ||
| c_strides = [] | ||
| stride = itemsize | ||
| for i in range(ndim - 1, -1, -1): | ||
| c_strides.insert(0, stride) | ||
| stride *= shape[i] | ||
|
|
||
| if tuple(strides) == tuple(c_strides): | ||
| layout = "C" | ||
| else: | ||
| # Check F-contiguous | ||
| f_strides = [] | ||
| stride = itemsize | ||
| for i in range(ndim): | ||
| f_strides.append(stride) | ||
| stride *= shape[i] | ||
| if tuple(strides) == tuple(f_strides): | ||
| layout = "F" | ||
| else: | ||
| layout = "A" |
There was a problem hiding this comment.
Perhaps one can do a quick rejection test to avoid reconstruction of entire c_strides or f_strides.
If neither last nor first element of strides sequence equals 1, we can set layout = "A".
|
Closing this PR, as I've upstreamed the bulk of the work to numba-cuda (merged): NVIDIA/numba-cuda#666 I opened #7008 as a follow-up to that, to add the minor changes needed to |
Description
This PR adds support for ops that can capture state. This can be used to implement algorithms wide side-effects:
prints:
It works by extending numba-CUDA to recognize device-array-like objects, and teaching it how to capture them from enclosing scope using
lower_constant.Checklist