Skip to content

Conversation

@cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Jan 6, 2026

Remove from_cuda_array_interface call when inferring type from cupy and torch arrays. Speed up mainly benefits dispatching calls, and is upwards of ~2.5x in some cases.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 6, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 6, 2026

/ok to test

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 6, 2026

Benchmarks:

-------------------------------------------------------------------------------- benchmark 'test_many_args[dispatch-cupy]': 2 tests --------------------------------------------------------------------------------
Name (time in ms)                                     Min                 Max                Mean            StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-cupy] (NOW)               89.6508 (1.0)       91.4552 (1.0)       90.2616 (1.0)      0.5774 (1.0)       90.0616 (1.0)      0.6963 (1.0)           2;0  11.0789 (1.0)           9           1
test_many_args[dispatch-cupy] (0001_2291482)     214.5098 (2.39)     222.1062 (2.43)     218.8504 (2.42)     3.0810 (5.34)     219.9658 (2.44)     4.7617 (6.84)          2;0   4.5693 (0.41)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_many_args[dispatch-device_array]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                           Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-device_array] (NOW)              7.3913 (1.0)      8.0769 (1.0)      7.8257 (1.01)     0.1785 (1.17)     7.8325 (1.02)     0.2293 (2.76)         13;1  127.7840 (0.99)         33           1
test_many_args[dispatch-device_array] (0001_2291482)     7.5903 (1.03)     8.3959 (1.04)     7.7254 (1.0)      0.1521 (1.0)      7.6848 (1.0)      0.0831 (1.0)           2;3  129.4427 (1.0)          34           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------- benchmark 'test_many_args[dispatch-torch]': 2 tests --------------------------------------------------------------------------------
Name (time in ms)                                      Min                 Max                Mean            StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-torch] (NOW)               76.3577 (1.0)       81.2349 (1.0)       78.0553 (1.0)      1.7612 (1.0)       77.6029 (1.0)      1.8356 (1.0)           2;1  12.8114 (1.0)          10           1
test_many_args[dispatch-torch] (0001_2291482)     194.0611 (2.54)     207.7775 (2.56)     199.2025 (2.55)     5.1382 (2.92)     198.3603 (2.56)     4.7597 (2.59)          2;0   5.0200 (0.39)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[signature-cupy]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                     Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-cupy] (0001_2291482)     63.8329 (1.0)      67.9061 (1.0)      65.8729 (1.0)      1.3112 (1.0)      65.6053 (1.0)      2.0993 (1.0)           6;0  15.1808 (1.0)          16           1
test_many_args[signature-cupy] (NOW)              66.3651 (1.04)     71.7770 (1.06)     69.4699 (1.05)     1.6888 (1.29)     69.7620 (1.06)     2.5991 (1.24)          5;0  14.3947 (0.95)         15           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_many_args[signature-device_array]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                            Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-device_array] (NOW)              7.6645 (1.0)      8.6032 (1.03)     7.9021 (1.0)      0.1163 (1.0)      7.8817 (1.0)      0.1170 (1.0)          20;6  126.5485 (1.0)         125           1
test_many_args[signature-device_array] (0001_2291482)     7.7900 (1.02)     8.3932 (1.0)      8.0326 (1.02)     0.1220 (1.05)     8.0036 (1.02)     0.1795 (1.53)         35;1  124.4927 (0.98)        124           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[signature-torch]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                      Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-torch] (0001_2291482)     49.2430 (1.0)      52.5490 (1.0)      50.0586 (1.0)      0.6955 (1.0)      49.9634 (1.0)      0.7920 (2.91)          2;1  19.9766 (1.0)          21           1
test_many_args[signature-torch] (NOW)              53.5552 (1.09)     56.9416 (1.08)     56.3639 (1.13)     0.9894 (1.42)     56.6450 (1.13)     0.2725 (1.0)           2;2  17.7419 (0.89)         19           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-cupy]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                  Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-cupy] (NOW)               6.6094 (1.0)       7.6386 (1.0)       7.0844 (1.0)      0.2969 (1.0)       6.9811 (1.0)      0.3962 (1.0)          21;0  141.1562 (1.0)          51           1
test_one_arg[dispatch-cupy] (0001_2291482)     11.3337 (1.71)     13.9110 (1.82)     12.3475 (1.74)     0.5709 (1.92)     12.2268 (1.75)     0.5489 (1.39)         10;3   80.9880 (0.57)         41           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-device_array]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                         Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-device_array] (NOW)              1.6835 (1.0)      1.7297 (1.0)      1.7092 (1.0)      0.0188 (1.0)      1.7080 (1.0)      0.0324 (1.06)          3;0  585.0609 (1.0)           6           1
test_one_arg[dispatch-device_array] (0001_2291482)     1.7268 (1.03)     1.7738 (1.03)     1.7555 (1.03)     0.0197 (1.05)     1.7639 (1.03)     0.0306 (1.0)           1;0  569.6463 (0.97)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_one_arg[dispatch-torch]': 2 tests -------------------------------------------------------------------------------
Name (time in ms)                                   Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-torch] (NOW)               5.9903 (1.0)       6.3998 (1.0)       6.2124 (1.0)      0.1173 (1.0)       6.2603 (1.0)      0.2288 (1.0)          18;0  160.9694 (1.0)          52           1
test_one_arg[dispatch-torch] (0001_2291482)     10.5430 (1.76)     11.2606 (1.76)     10.8350 (1.74)     0.2048 (1.75)     10.7178 (1.71)     0.3808 (1.66)         14;0   92.2931 (0.57)         40           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_one_arg[signature-cupy]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                  Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-cupy] (0001_2291482)     4.5082 (1.0)      6.9082 (1.07)     5.3411 (1.02)     0.5337 (1.17)     5.3418 (1.06)     0.9416 (1.21)         81;0  187.2274 (0.98)        187           1
test_one_arg[signature-cupy] (NOW)              4.6543 (1.03)     6.4592 (1.0)      5.2385 (1.0)      0.4572 (1.0)      5.0360 (1.0)      0.7785 (1.0)          73;0  190.8960 (1.0)         198           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_one_arg[signature-device_array]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                          Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-device_array] (0001_2291482)     1.6652 (1.0)      2.4609 (1.14)     1.8046 (1.0)      0.1041 (1.44)     1.7810 (1.0)      0.0754 (1.0)         47;25  554.1507 (1.0)         456           1
test_one_arg[signature-device_array] (NOW)              1.7005 (1.02)     2.1632 (1.0)      1.8792 (1.04)     0.0726 (1.0)      1.9043 (1.07)     0.1381 (1.83)        178;1  532.1514 (0.96)        489           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_one_arg[signature-torch]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                   Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-torch] (NOW)              4.0829 (1.0)      4.5733 (1.0)      4.3557 (1.00)     0.0800 (1.0)      4.3773 (1.02)     0.0945 (1.0)         51;10  229.5819 (1.00)        220           1
test_one_arg[signature-torch] (0001_2291482)     4.1086 (1.01)     5.1700 (1.13)     4.3389 (1.0)      0.1538 (1.92)     4.2765 (1.0)      0.1822 (1.93)         27;7  230.4755 (1.0)         221           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
======================================================================================== 12 passed in 12.21s =========================================================================================
image

@cpcloud cpcloud requested a review from gmarkall January 6, 2026 17:36
handled in numba.cuda.np.arrayobj.
"""
# Only handle constants, not arguments (arguments use regular array typing)
if c.purpose == Purpose.argument:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning None here caused from_cuda_array_interface to be called, which is much more expensive than just executing the rest of this code.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 6, 2026

Greptile Summary

This PR optimizes type inference for third-party device arrays (CuPy, PyTorch) that implement __cuda_array_interface__. Previously, the dispatcher would catch a ValueError and fall back to calling from_cuda_array_interface to create an intermediate DeviceNDArray, then type that object. Now it lets the typing system handle the interface directly, eliminating the intermediate object allocation and improving dispatch performance by up to 2.5x.

Key changes:

  • Removed redundant exception handling in typeof_pyval that called from_cuda_array_interface
  • Removed the Purpose.argument check in _typeof_cuda_array_interface to enable direct typing of CUDA array interface objects for arguments (not just constants)
  • Added @functools.lru_cache to strides_from_shape to memoize stride calculations
  • Minor code cleanups: simplified comparisons (not shape vs len(shape) == 0), used operator.eq with map instead of zip comprehensions, and tuple unpacking for readability

Confidence Score: 5/5

  • This PR is safe to merge with no identified risks
  • The changes are well-structured performance optimizations that simplify the code path by removing redundant logic. The removal of the Purpose.argument check in _typeof_cuda_array_interface is safe because the function now correctly handles both arguments and constants directly. The lru_cache decorator is correctly applied to a function with immutable, hashable arguments. All changes maintain the same logical behavior while improving performance.
  • No files require special attention

Important Files Changed

Filename Overview
numba_cuda/numba/cuda/dispatcher.py Simplified typeof_pyval by removing redundant exception handling and from_cuda_array_interface call - CUDA array interface objects now fall through to direct typing
numba_cuda/numba/cuda/typing/typeof.py Removed Purpose.argument check to enable direct typing of CUDA array interface objects, added micro-optimizations with operator.eq and tuple unpacking
numba_cuda/numba/cuda/np/numpy_support.py Added @functools.lru_cache to strides_from_shape for memoization, simplified zero-length check with not shape

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 6, 2026

/ok to test

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 6, 2026

Threw in a commit with some more shape/stride caching, which gives a bit more perf gains across the board relative to just removing the from_cuda_array_interface overhead.

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 6, 2026

/ok to test

@cpcloud cpcloud merged commit 53d1c78 into NVIDIA:main Jan 7, 2026
130 checks passed
@cpcloud cpcloud deleted the remove-unnecessary-cai branch January 7, 2026 14:55
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Jan 12, 2026
- Add arch specific target support (NVIDIA#549)
- chore: disable `locked` flag to bypass prefix-dev/pixi#5256 (NVIDIA#714)
- ci: relock pixi (NVIDIA#712)
- ci: remove redundant conda build in ci (NVIDIA#711)
- chore(deps): bump numba-cuda version and relock pixi (NVIDIA#707)
- Dropping bits in the old CI & Propagating recent changes from cuda-python (NVIDIA#683)
- Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and `nvrtc` packages for CUDA 13 (NVIDIA#701)
- perf: remove some exception control flow and buffer-exception penalization for arrays (NVIDIA#700)
- perf: let CAI fall through instead of calling from_cuda_array_interface (NVIDIA#694)
- chore: perf lint (NVIDIA#697)
- chore(deps): bump deps in pixi lockfile (NVIDIA#693)
- fix: use freethreading-supported `_PySet_NextItemRef` where possible (NVIDIA#682)
- Support python `3.14` (NVIDIA#599)
- Remove customized address space tracking and address class emission in debug info (NVIDIA#669)
- Drop `experimental` from cuda.core namespace imports (NVIDIA#676)
- Remove dangling references to NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (NVIDIA#675)
- Use `rapidsai/sccache` in CI (NVIDIA#674)
- chore(dev-deps): remove ipython and pyinstrument (NVIDIA#670)
- Set up a new VM-based CI infrastructure  (NVIDIA#604)
@gmarkall gmarkall mentioned this pull request Jan 12, 2026
gmarkall added a commit that referenced this pull request Jan 12, 2026
- Add arch specific target support (#549)
- chore: disable `locked` flag to bypass
prefix-dev/pixi#5256 (#714)
- ci: relock pixi (#712)
- ci: remove redundant conda build in ci (#711)
- chore(deps): bump numba-cuda version and relock pixi (#707)
- Dropping bits in the old CI & Propagating recent changes from
cuda-python (#683)
- Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and
`nvrtc` packages for CUDA 13 (#701)
- perf: remove some exception control flow and buffer-exception
penalization for arrays (#700)
- perf: let CAI fall through instead of calling
from_cuda_array_interface (#694)
- chore: perf lint (#697)
- chore(deps): bump deps in pixi lockfile (#693)
- fix: use freethreading-supported `_PySet_NextItemRef` where possible
(#682)
- Support python `3.14` (#599)
- Remove customized address space tracking and address class emission in
debug info (#669)
- Drop `experimental` from cuda.core namespace imports (#676)
- Remove dangling references to
NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (#675)
- Use `rapidsai/sccache` in CI (#674)
- chore(dev-deps): remove ipython and pyinstrument (#670)
- Set up a new VM-based CI infrastructure  (#604)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants