Skip to content

Conversation

@cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Jan 7, 2026

Summary

Refactor kernel argument handling to use StridedMemoryView internally,
enabling direct support for __dlpack__ objects and improving interoperability
with libraries like CuPy.

Closes: #152
Tracking issue: #128

Key Changes

New capability: Kernel arguments now accept objects with __dlpack__
protocol directly (e.g., CuPy arrays).

Internals: Replaced array interface handling with
cuda.core.utils.StridedMemoryView for:

  • __dlpack__ objects (new)
  • __cuda_array_interface__ objects
  • Device arrays (existing DeviceNDArray)
  • NumPy arrays (copied to device as before)

Performance:

  • CuPy arrays: ~3x improvement on kernel launch (initial measurements)
  • device_array() arrays: ~2.5x regression (initial measurements)
  • torch shows regression as well, probably because its DLPack implementation is
    slow. Previously it was going through CAI but its CAI version isn't supported
    by StridedMemoryView

Performance Trade-off Discussion

The 2.5x slowdown for device_array() is worth discussing (and perhaps the
torch regression is as well):

Arguments for accepting this regression:

  1. CuPy and other __dlpack__ libraries represent the primary ecosystem (or at
    least the end goal) for GPU computing in Python
  2. The 3x speedup benefits the common interoperability use case for objects
    that we are prioritizing
  3. device_array() is primarily used in legacy code and tests and is
    deprecated

Why this might be worth merging despite the regression:

  • Blocking ecosystem integration to optimize a legacy API doesn't align with
    the project's direction
  • The gains where it matters (external arrays) are substantial
  • Performance parity for device_arrays could be addressed in follow-up work if
    it proves important

Implementation Details

  • Added _to_strided_memory_view() and _make_strided_memory_view() helper
    functions (numba_cuda/numba/cuda/cudadrv/devicearray.py:247-359)
  • Updated kernel argument marshaling in dispatcher.py:541-614
  • Added LRU caching to typeof for CAI objects to reduce type inference
    overhead (typing/typeof.py:315-365)

Testing

Existing test suite passes.


TL;DR: Adds __dlpack__ support (~3x faster for CuPy), with ~2.5x
regression on legacy device_array(). Trade-off favors ecosystem integration,
but open to discussion.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 7, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cpcloud cpcloud force-pushed the move-to-smv-for-kernel-launch branch from f5a1c5c to c62d013 Compare January 7, 2026 18:29
@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 7, 2026

/ok to test

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 7, 2026

Greptile Overview

Greptile Summary

This PR refactors internal kernel argument handling to use StridedMemoryView from cuda.core.utils, enabling direct __dlpack__ protocol support and improving interoperability with libraries like CuPy.

Key improvements:

  • Adds native __dlpack__ support for kernel arguments (CuPy arrays ~3x faster on kernel launch)
  • Replaces legacy array interface handling with unified StridedMemoryView approach
  • Adds LRU caching to type inference operations for performance
  • Comprehensive test coverage for new functionality (F-contiguous arrays, datetime support)

Trade-offs:

  • ~2.5x regression for legacy device_array() (deprecated API)
  • Prioritizes ecosystem integration (CuPy, external libraries) over legacy code performance

Implementation quality:

  • Clean refactoring with proper separation of concerns
  • Simulator mode properly updated with matching arg handling
  • cuDF compatibility addressed via patch file
  • All previously raised concerns have been addressed

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • Well-structured refactoring with comprehensive test coverage, all existing tests pass, and previous review comments have been addressed. The performance trade-off is documented and justified.
  • No files require special attention

Important Files Changed

Filename Overview
numba_cuda/numba/cuda/cudadrv/devicearray.py Core refactoring to add StridedMemoryView with new _to_strided_memory_view and _make_strided_memory_view helper functions, proper dlpack support
numba_cuda/numba/cuda/dispatcher.py Updated kernel argument marshaling to use StridedMemoryView shim, changed from device_ctypes_pointer to ptr attribute
numba_cuda/numba/cuda/args.py Refactored Out/InOut classes to use StridedMemoryView, added custom copy_to_host implementation with proper stream handling
numba_cuda/numba/cuda/np/numpy_support.py Changed strides_from_shape signature to use boolean flags instead of order string, added lru_cache to from_dtype
numba_cuda/numba/cuda/typing/typeof.py Added LRU caching to CAI typeof operations to reduce type inference overhead for performance

@cpcloud cpcloud requested review from gmarkall and kkraus14 January 7, 2026 18:35
@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 7, 2026

This PR can't be merged until the next release of cuda-core, because I depend on some unreleased features there.

However, it's still worth reviewing.

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 7, 2026

Current benchmarks versus main (NOW is this PR, the other is upstream/main as of when I ran the command):

image
Full text of benchmark results
------------------------------------------------------------------------------ benchmark 'test_many_args[dispatch-cupy]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                    Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-cupy] (NOW)              28.5491 (1.0)      32.0701 (1.0)      29.3013 (1.0)      0.8509 (3.39)     29.0451 (1.0)      0.7380 (2.20)          3;1  34.1282 (1.0)          20           1
test_many_args[dispatch-cupy] (0001_459b8c0)     81.2223 (2.85)     82.0122 (2.56)     81.6227 (2.79)     0.2509 (1.0)      81.5803 (2.81)     0.3350 (1.0)           3;0  12.2515 (0.36)          8           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------- benchmark 'test_many_args[dispatch-device_array]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                            Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-device_array] (0001_459b8c0)      6.5039 (1.0)       6.6605 (1.0)       6.5602 (1.0)      0.0347 (1.0)       6.5567 (1.0)      0.0391 (1.0)           5;1  152.4334 (1.0)          21           1
test_many_args[dispatch-device_array] (NOW)              16.1680 (2.49)     18.7992 (2.82)     16.4472 (2.51)     0.5087 (14.68)    16.3114 (2.49)     0.1429 (3.65)          2;3   60.8005 (0.40)         28           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[dispatch-torch]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                     Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-torch] (NOW)              61.4096 (1.0)      73.2956 (1.03)     65.9567 (1.0)      3.3014 (6.78)     65.8106 (1.0)      3.7330 (7.42)          4;1  15.1615 (1.0)          13           1
test_many_args[dispatch-torch] (0001_459b8c0)     69.6396 (1.13)     71.3505 (1.0)      70.7103 (1.07)     0.4867 (1.0)      70.8821 (1.08)     0.5033 (1.0)           3;1  14.1422 (0.93)         10           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[signature-cupy]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                     Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-cupy] (NOW)              20.3893 (1.0)      22.5916 (1.0)      20.7331 (1.0)      0.2960 (1.79)     20.6997 (1.0)      0.1337 (1.0)           3;3  48.2320 (1.0)          48           1
test_many_args[signature-cupy] (0001_459b8c0)     61.7753 (3.03)     62.3395 (2.76)     61.9959 (2.99)     0.1649 (1.0)      61.9470 (2.99)     0.1919 (1.44)          5;0  16.1301 (0.33)         17           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[signature-device_array]': 2 tests -------------------------------------------------------------------------------
Name (time in ms)                                             Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-device_array] (0001_459b8c0)      6.7572 (1.0)       6.9647 (1.0)       6.8144 (1.0)      0.0366 (1.0)       6.8028 (1.0)      0.0550 (1.0)          38;2  146.7473 (1.0)         141           1
test_many_args[signature-device_array] (NOW)              16.7750 (2.48)     18.1621 (2.61)     17.2505 (2.53)     0.2715 (7.42)     17.1982 (2.53)     0.2356 (4.28)         12;5   57.9693 (0.40)         58           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[signature-torch]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                      Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-torch] (0001_459b8c0)     49.9804 (1.0)      50.5731 (1.0)      50.1364 (1.0)      0.1298 (1.0)      50.1156 (1.0)      0.1224 (1.0)           4;1  19.9456 (1.0)          20           1
test_many_args[signature-torch] (NOW)              51.7247 (1.03)     53.2339 (1.05)     52.3643 (1.04)     0.4227 (3.26)     52.5188 (1.05)     0.7008 (5.73)          6;0  19.0970 (0.96)         20           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-cupy]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                 Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-cupy] (NOW)              3.1364 (1.0)      3.2981 (1.0)      3.2120 (1.0)      0.0336 (1.17)     3.2084 (1.0)      0.0362 (1.09)         18;3  311.3370 (1.0)          63           1
test_one_arg[dispatch-cupy] (0001_459b8c0)     5.6052 (1.79)     5.7344 (1.74)     5.6561 (1.76)     0.0286 (1.0)      5.6565 (1.76)     0.0333 (1.0)           8;2  176.7993 (0.57)         35           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-device_array]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                         Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-device_array] (0001_459b8c0)     1.2796 (1.0)      1.3412 (1.0)      1.3013 (1.0)      0.0261 (2.85)     1.2922 (1.0)      0.0400 (3.88)          1;0  768.4784 (1.0)           5           1
test_one_arg[dispatch-device_array] (NOW)              1.9005 (1.49)     1.9257 (1.44)     1.9143 (1.47)     0.0091 (1.0)      1.9148 (1.48)     0.0103 (1.0)           2;0  522.3939 (0.68)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-torch]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                  Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-torch] (NOW)              4.9126 (1.0)      5.9787 (1.16)     5.0667 (1.0)      0.2255 (15.06)    4.9729 (1.0)      0.0839 (3.53)        10;11  197.3664 (1.0)          58           1
test_one_arg[dispatch-torch] (0001_459b8c0)     5.1037 (1.04)     5.1586 (1.0)      5.1313 (1.01)     0.0150 (1.0)      5.1302 (1.03)     0.0238 (1.0)          13;0  194.8823 (0.99)         32           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_one_arg[signature-cupy]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                  Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-cupy] (NOW)              2.1288 (1.0)      2.7003 (1.0)      2.2073 (1.0)      0.0588 (1.0)      2.1964 (1.0)      0.0452 (1.0)         41;20  453.0358 (1.0)         401           1
test_one_arg[signature-cupy] (0001_459b8c0)     3.8849 (1.82)     4.4341 (1.64)     4.0874 (1.85)     0.1246 (2.12)     4.0341 (1.84)     0.2278 (5.04)        109;0  244.6541 (0.54)        231           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_one_arg[signature-device_array]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                          Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-device_array] (0001_459b8c0)     1.2409 (1.0)      1.9163 (1.0)      1.2736 (1.0)      0.0424 (1.0)      1.2686 (1.0)      0.0137 (1.0)         23;32  785.1979 (1.0)         646           1
test_one_arg[signature-device_array] (NOW)              1.8690 (1.51)     2.8435 (1.48)     1.9830 (1.56)     0.1331 (3.14)     1.9313 (1.52)     0.1464 (10.69)       32;12  504.2930 (0.64)        453           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_one_arg[signature-torch]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                   Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-torch] (0001_459b8c0)     3.4580 (1.0)      3.7149 (1.0)      3.5206 (1.0)      0.0280 (1.0)      3.5193 (1.0)      0.0324 (1.0)          75;6  284.0446 (1.0)         259           1
test_one_arg[signature-torch] (NOW)              4.1100 (1.19)     4.6233 (1.24)     4.2768 (1.21)     0.1268 (4.53)     4.3273 (1.23)     0.2216 (6.83)         87;0  233.8189 (0.82)        224           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
======================================================================================== 12 passed in 11.54s =========================================================================================

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 7, 2026

I managed to recover a good amount of perf of devicearray by avoiding the SMV conversion entirely and spoofing the interface.

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 7, 2026

However there is still a slowdown of ~60%, but only in the many-args case (it's about 15% in the single argument case). This is much better than the previous commit which was upwards of 2.5x.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Refactored kernel argument handling to use StridedMemoryView internally, enabling direct __dlpack__ protocol support and improving CuPy interoperability (~3x speedup).

Key Changes

  • Replaced auto_device() calls with _to_strided_memory_view() for unified array handling
  • Added LRU caching to type inference functions (typeof, from_dtype, strides_from_shape) to reduce overhead
  • Converted several properties to @functools.cached_property for performance
  • Refactored Out/InOut classes to use inheritance pattern with copy_input class variable
  • Changed strides_from_shape() API from order="C"/"F" to boolean flags c_contiguous/f_contiguous

Issues Found

  • Logic bug in strides_from_shape(): when both c_contiguous and f_contiguous are False, function produces incorrect strides (computes F-contiguous then reverses, which is neither C nor F layout)

Performance Trade-offs

The PR documents a ~2.5x regression for legacy device_array() in exchange for ~3x improvement for CuPy arrays. This aligns with the project's strategic direction toward ecosystem integration.

Confidence Score: 4/5

  • This PR is safe to merge with one logic issue that needs fixing
  • Score reflects well-structured refactoring with proper caching optimizations, but one critical logic bug in strides_from_shape() when both contiguity flags are False needs resolution before merge
  • numba_cuda/numba/cuda/np/numpy_support.py - fix the strides_from_shape() logic for handling non-contiguous arrays

Important Files Changed

File Analysis

Filename Score Overview
numba_cuda/numba/cuda/np/numpy_support.py 3/5 Added LRU caching to strides_from_shape and from_dtype; changed API from order parameter to c_contiguous/f_contiguous flags. Logic issue: when both flags are False, function computes F-contiguous strides then reverses them unexpectedly.
numba_cuda/numba/cuda/cudadrv/devicearray.py 4/5 Added _to_strided_memory_view and _make_strided_memory_view helper functions to support DLPack protocol; converted nbytes and added _strided_memory_view_shim to cached properties. Implementation looks solid.
numba_cuda/numba/cuda/args.py 4/5 Refactored Out and InOut classes to use StridedMemoryView; changed _numba_type_ to cached property. Clean refactor with proper class inheritance.
numba_cuda/numba/cuda/dispatcher.py 4/5 Updated kernel argument marshaling to work with StridedMemoryView objects instead of DeviceNDArray. Uses fallback to strides_from_shape when strides not available.
numba_cuda/numba/cuda/typing/typeof.py 5/5 Added LRU caching to _typeof_cuda_array_interface by extracting logic into cached helper functions. All parameters are hashable, caching is safe and should improve performance.
numba_cuda/numba/cuda/np/arrayobj.py 5/5 Updated call to strides_from_shape to use new keyword-only argument API with c_contiguous=True. Minimal, straightforward change.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@cpcloud cpcloud force-pushed the move-to-smv-for-kernel-launch branch 2 times, most recently from 9ff51b9 to 1032275 Compare January 7, 2026 19:34
@rparolin rparolin self-requested a review January 7, 2026 22:11
Copy link
Contributor

@rparolin rparolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me. I'm a bit on the fence about shipping a known performance regression to a deprecated type. I'd feel better if we removed it first instead of regressing on performance. All that being said, the regression has improved from the initially reported 2.5x.

I'd still wait to merge until @gmarkall gives the final 👍

@kkraus14
Copy link
Contributor

kkraus14 commented Jan 8, 2026

So the results of where we are is that using CuPy Arrays has ~3x less latency, but using device arrays or torch tensors has ~60% more latency?

On the torch front NVIDIA/cuda-python#1439 may help in bypassing the slow __dlpack__ implementation.

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 8, 2026

So the results of where we are is that using CuPy Arrays has ~3x less latency, but using device arrays or torch tensors has ~60% more latency?

Almost. Will post new numbers in a bit.

On the torch front NVIDIA/cuda-python#1439 may help in bypassing the slow __dlpack__ implementation.

Passing stream_ptr=-1 removes most of the additional torch overhead versus cupy, which I believe indicates that it's the stream synchronization done by torch (or StridedMemoryView, or both) that causes the slowness. I think that also means removing Python overhead won't help with that source of performance loss.

@cpcloud cpcloud force-pushed the move-to-smv-for-kernel-launch branch from 1032275 to 739cb5b Compare January 8, 2026 18:21
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Refactors internal kernel argument handling to use StridedMemoryView from cuda-python, enabling direct __dlpack__ protocol support for external arrays like CuPy. Replaces __cuda_array_interface__ handling with the unified StridedMemoryView API and adds LRU caching to type inference paths to reduce overhead. Performance measurements show ~3x improvement for CuPy arrays but ~2.5x regression for legacy device_array() objects, which the PR justifies as an acceptable trade-off favoring ecosystem integration over deprecated APIs.

Confidence Score: 1/5

  • Critical logic bug in strides fallback for 0-dimensional arrays will cause incorrect behavior
  • The PR contains a critical logic error in dispatcher.py line 558 where the strides fallback uses or operator with potentially empty tuples. For 0-dimensional arrays, strides_in_bytes is legitimately (), but empty tuples are falsy in Python, triggering unnecessary fallback computation. While the fallback should also return (), this indicates a misunderstanding of the truthiness semantics that could mask other issues. Additionally, there are multiple stream handling edge cases around stream=0 that should be verified for correctness.
  • numba_cuda/numba/cuda/dispatcher.py requires immediate attention for the strides fallback bug; numba_cuda/numba/cuda/args.py needs verification of stream_ptr=0 handling semantics

Important Files Changed

File Analysis

Filename Score Overview
numba_cuda/numba/cuda/dispatcher.py 1/5 Refactored kernel argument marshaling to use StridedMemoryView; critical bug in strides fallback logic
numba_cuda/numba/cuda/cudadrv/devicearray.py 2/5 Added _to_strided_memory_view and _make_strided_memory_view functions for dlpack/CAI conversion; changed nbytes to cached_property
numba_cuda/numba/cuda/args.py 3/5 Refactored Out and InOut classes to use _to_strided_memory_view; InOut now inherits from Out with copy_input=True

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 8, 2026

@kkraus14

  • - is an improvement
  • + is a regression

cupy:

  • one arg: -2x
  • many args: -3x

torch:

  • one arg: +17%
  • many args: +7% for explicit signature, -12% for dispatch

device_array:

  • one arg: roughly +9%
  • many args: roughly +60%
image

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 8, 2026

I also added a benchmark demonstrating that the additional overhead with device_array is entirely due to having to construct a shim to fit device array information into an object that is API-compatible with the attributes we pull out of StridedMemoryView when preparing arguments for kernel launch.

Maybe there's some way that we can reduce that further, but I haven't looked into it.

@kkraus14
Copy link
Contributor

The devicearray regressions are somewhat concerning, but given we are actively working towards deprecating it, I think it would still be worth it.

Do we have a sense on follow up work from here that helps to ameliorate the performance overheads related to torch?

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 12, 2026

The devicearray regressions are somewhat concerning, but given we are actively working towards deprecating it, I think it would still be worth it.

Do we have a sense on follow up work from here that helps to ameliorate the performance overheads related to torch?

At least some of the remaining overhead is related to stream synchronization, but that may be justified/useful in some cases I'm guessing.

After that, I'm not sure. It will require more investigation.

Just to make sure we're on the same page, our expectation is that if an array is on device then we should expect the kernel launch overhead to amount to a collection of relatively cheap attribute accesses. Is that correct?

@kkraus14
Copy link
Contributor

At least some of the remaining overhead is related to stream synchronization, but that may be justified/useful in some cases I'm guessing.

My 2c: numba-cuda shouldn't be in the business of handling stream synchronization and that if someone is passing an array on a different stream through dlpack / CAI, it becomes their responsibility to launch the kernel on a stream that is synchronized with respect to the passed stream. This is likely a breaking change that would need to be clearly and loudly deprecated and subsequently removed.

Just to make sure we're on the same page, our expectation is that if an array is on device then we should expect the kernel launch overhead to amount to a collection of relatively cheap attribute accesses. Is that correct?

Yes. Kernel launch latency is quite important where we should aim for less than 1us overhead.

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 12, 2026

My 2c: numba-cuda shouldn't be in the business of handling stream synchronization and that if someone is passing an array on a different stream through dlpack / CAI, it becomes their responsibility to launch the kernel on a stream that is synchronized with respect to the passed stream. This is likely a breaking change that would need to be clearly and loudly deprecated and subsequently removed.

Got it, yeah I don't really know enough about how this functionality is used or assumed to be used to have an informed opinion (yet!), but simply removing sync (in the torch case by passing -1 as the stream pointer to __dlpack__) saves a noticeable amount of time.

Yes. Kernel launch latency is quite important where we should aim for less than 1us overhead.

Roger that, I think we can get there if not very close.

@cpcloud cpcloud force-pushed the move-to-smv-for-kernel-launch branch from 739cb5b to 891ccb7 Compare January 12, 2026 18:04
@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 12, 2026

/ok to test

@cpcloud cpcloud force-pushed the move-to-smv-for-kernel-launch branch from e2a664c to 4632643 Compare January 26, 2026 15:35
@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 26, 2026

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

cpcloud and others added 4 commits January 26, 2026 12:29
There is no external behavioural change expected from switching to use
SMV, but using SMV from the simulator causes issues. To avoid this, we
copy in the original args code into the simulator and use it there.
@cpcloud cpcloud force-pushed the move-to-smv-for-kernel-launch branch from 4632643 to c7743b5 Compare January 26, 2026 17:31
@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 26, 2026

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 26, 2026

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 26, 2026

/ok to test

"numba>=0.60.0",
"cuda-bindings>=12.9.1,<14.0.0",
"cuda-core>=0.3.2,<1.0.0",
"cuda-core>=0.5.1,<1.0.0",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lower bound increased because the various APIs that this PR uses are only available in cuda-core >= 0.5.1.

The cu12 and cu13 constraints are removed, since the new cuda-core >=0.5.1 constraint invalidates the previously declared constraints.

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 26, 2026

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 26, 2026

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cpcloud cpcloud merged commit 8928503 into NVIDIA:main Jan 27, 2026
104 checks passed
@cpcloud cpcloud deleted the move-to-smv-for-kernel-launch branch January 27, 2026 14:33
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Jan 27, 2026
- Add Python 3.14 to the wheel publishing matrix (NVIDIA#750)
- feat: swap out internal device array usage with `StridedMemoryView` (NVIDIA#703)
- Fix max block size computation in `forall` (NVIDIA#744)
- Fix prologue debug line info pointing to decorator instead of def line (NVIDIA#746)
- Fix kernel return type in DISubroutineType debug metadata (NVIDIA#745)
- Fix missing line info in Jupyter notebooks (NVIDIA#742)
- Fix: Pass correct flags to linker when debugging in the presence of LTOIR code (NVIDIA#698)
- chore(deps): add cuda-pathfinder to pixi deps (NVIDIA#741)
- fix: enable flake8-bugbear lints and fix found problems (NVIDIA#708)
- fix: Fix race condition in CUDA Simulator (NVIDIA#690)
- ci: run tests in parallel (NVIDIA#740)
- feat: users can pass `shared_memory_carveout` to @cuda.jit (NVIDIA#642)
- Fix compatibility with NumPy 2.4: np.trapz and np.in1d removed (NVIDIA#739)
- Pass the -numba-debug flag to libnvvm (NVIDIA#681)
- ci: remove rapids containers from conda ci (NVIDIA#737)
- Use `pathfinder` for dynamic libraries (NVIDIA#308)
- CI: Add CUDA 13.1 testing support (NVIDIA#705)
- Adding `pixi run test` and `pixi run test-par` support (NVIDIA#724)
- Disable per-PR nvmath tests + follow same test practice (NVIDIA#723)
- chore(deps): regenerate pixi lockfile (NVIDIA#722)
- Fix DISubprogram line number to point to function definition line (NVIDIA#695)
- revert: chore(dev): build pixi using rattler (NVIDIA#713) (NVIDIA#719)
- [feat] Initial version of the Numba CUDA GDB pretty-printer (NVIDIA#692)
- chore(dev): build pixi using rattler (NVIDIA#713)
- build(deps): bump the actions-monthly group across 1 directory with 8 updates (NVIDIA#704)
@gmarkall gmarkall mentioned this pull request Jan 27, 2026
kkraus14 pushed a commit that referenced this pull request Jan 28, 2026
- Add Python 3.14 to the wheel publishing matrix (#750)
- feat: swap out internal device array usage with `StridedMemoryView`
(#703)
- Fix max block size computation in `forall` (#744)
- Fix prologue debug line info pointing to decorator instead of def line
(#746)
- Fix kernel return type in DISubroutineType debug metadata (#745)
- Fix missing line info in Jupyter notebooks (#742)
- Fix: Pass correct flags to linker when debugging in the presence of
LTOIR code (#698)
- chore(deps): add cuda-pathfinder to pixi deps (#741)
- fix: enable flake8-bugbear lints and fix found problems (#708)
- fix: Fix race condition in CUDA Simulator (#690)
- ci: run tests in parallel (#740)
- feat: users can pass `shared_memory_carveout` to @cuda.jit (#642)
- Fix compatibility with NumPy 2.4: np.trapz and np.in1d removed (#739)
- Pass the -numba-debug flag to libnvvm (#681)
- ci: remove rapids containers from conda ci (#737)
- Use `pathfinder` for dynamic libraries (#308)
- CI: Add CUDA 13.1 testing support (#705)
- Adding `pixi run test` and `pixi run test-par` support (#724)
- Disable per-PR nvmath tests + follow same test practice (#723)
- chore(deps): regenerate pixi lockfile (#722)
- Fix DISubprogram line number to point to function definition line
(#695)
- revert: chore(dev): build pixi using rattler (#713) (#719)
- [feat] Initial version of the Numba CUDA GDB pretty-printer (#692)
- chore(dev): build pixi using rattler (#713)
- build(deps): bump the actions-monthly group across 1 directory with 8
updates (#704)

<!--

Thank you for contributing to numba-cuda :)

Here are some guidelines to help the review process go smoothly.

1. Please write a description in this text box of the changes that are
being
   made.

2. Please ensure that you have written units tests for the changes
made/features
   added.

3. If you are closing an issue please use one of the automatic closing
words as
noted here:
https://help.github.com/articles/closing-issues-using-keywords/

4. If your pull request is not ready for review but you want to make use
of the
continuous integration testing facilities please label it with `[WIP]`.

5. If your pull request is ready to be reviewed without requiring
additional
work on top of it, then remove the `[WIP]` label (if present) and
replace
it with `[REVIEW]`. If assistance is required to complete the
functionality,
for example when the C/C++ code of a feature is complete but Python
bindings
are still required, then add the label `[HELP-REQ]` so that others can
triage
and assist. The additional changes then can be implemented on top of the
same PR. If the assistance is done by members of the rapidsAI team, then
no
additional actions are required by the creator of the original PR for
this,
otherwise the original author of the PR needs to give permission to the
person(s) assisting to commit to their personal fork of the project. If
that
doesn't happen then a new PR based on the code of the original PR can be
opened by the person assisting, which then will be the PR that will be
   merged.

6. Once all work has been done and review has taken place please do not
add
features or make changes out of the scope of those requested by the
reviewer
(doing this just add delays as already reviewed code ends up having to
be
re-reviewed/it is hard to tell what is new etc!). Further, please do not
rebase your branch on main/force push/rewrite history, doing any of
these
   causes the context of any comments made by reviewers to be lost. If
   conflicts occur against main they should be resolved by merging main
   into the branch used for making the pull request.

Many thanks in advance for your cooperation!

-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants