perf: let CAI fall through instead of calling from_cuda_array_interface #694

cpcloud · 2026-01-06T17:34:54Z

Remove from_cuda_array_interface call when inferring type from cupy and torch arrays. Speed up mainly benefits dispatching calls, and is upwards of ~2.5x in some cases.

copy-pr-bot · 2026-01-06T17:34:58Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cpcloud · 2026-01-06T17:35:07Z

/ok to test

cpcloud · 2026-01-06T17:36:26Z

Benchmarks:

-------------------------------------------------------------------------------- benchmark 'test_many_args[dispatch-cupy]': 2 tests --------------------------------------------------------------------------------
Name (time in ms)                                     Min                 Max                Mean            StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-cupy] (NOW)               89.6508 (1.0)       91.4552 (1.0)       90.2616 (1.0)      0.5774 (1.0)       90.0616 (1.0)      0.6963 (1.0)           2;0  11.0789 (1.0)           9           1
test_many_args[dispatch-cupy] (0001_2291482)     214.5098 (2.39)     222.1062 (2.43)     218.8504 (2.42)     3.0810 (5.34)     219.9658 (2.44)     4.7617 (6.84)          2;0   4.5693 (0.41)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_many_args[dispatch-device_array]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                           Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-device_array] (NOW)              7.3913 (1.0)      8.0769 (1.0)      7.8257 (1.01)     0.1785 (1.17)     7.8325 (1.02)     0.2293 (2.76)         13;1  127.7840 (0.99)         33           1
test_many_args[dispatch-device_array] (0001_2291482)     7.5903 (1.03)     8.3959 (1.04)     7.7254 (1.0)      0.1521 (1.0)      7.6848 (1.0)      0.0831 (1.0)           2;3  129.4427 (1.0)          34           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------- benchmark 'test_many_args[dispatch-torch]': 2 tests --------------------------------------------------------------------------------
Name (time in ms)                                      Min                 Max                Mean            StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-torch] (NOW)               76.3577 (1.0)       81.2349 (1.0)       78.0553 (1.0)      1.7612 (1.0)       77.6029 (1.0)      1.8356 (1.0)           2;1  12.8114 (1.0)          10           1
test_many_args[dispatch-torch] (0001_2291482)     194.0611 (2.54)     207.7775 (2.56)     199.2025 (2.55)     5.1382 (2.92)     198.3603 (2.56)     4.7597 (2.59)          2;0   5.0200 (0.39)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[signature-cupy]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                     Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-cupy] (0001_2291482)     63.8329 (1.0)      67.9061 (1.0)      65.8729 (1.0)      1.3112 (1.0)      65.6053 (1.0)      2.0993 (1.0)           6;0  15.1808 (1.0)          16           1
test_many_args[signature-cupy] (NOW)              66.3651 (1.04)     71.7770 (1.06)     69.4699 (1.05)     1.6888 (1.29)     69.7620 (1.06)     2.5991 (1.24)          5;0  14.3947 (0.95)         15           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_many_args[signature-device_array]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                            Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-device_array] (NOW)              7.6645 (1.0)      8.6032 (1.03)     7.9021 (1.0)      0.1163 (1.0)      7.8817 (1.0)      0.1170 (1.0)          20;6  126.5485 (1.0)         125           1
test_many_args[signature-device_array] (0001_2291482)     7.7900 (1.02)     8.3932 (1.0)      8.0326 (1.02)     0.1220 (1.05)     8.0036 (1.02)     0.1795 (1.53)         35;1  124.4927 (0.98)        124           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[signature-torch]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                      Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-torch] (0001_2291482)     49.2430 (1.0)      52.5490 (1.0)      50.0586 (1.0)      0.6955 (1.0)      49.9634 (1.0)      0.7920 (2.91)          2;1  19.9766 (1.0)          21           1
test_many_args[signature-torch] (NOW)              53.5552 (1.09)     56.9416 (1.08)     56.3639 (1.13)     0.9894 (1.42)     56.6450 (1.13)     0.2725 (1.0)           2;2  17.7419 (0.89)         19           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-cupy]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                  Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-cupy] (NOW)               6.6094 (1.0)       7.6386 (1.0)       7.0844 (1.0)      0.2969 (1.0)       6.9811 (1.0)      0.3962 (1.0)          21;0  141.1562 (1.0)          51           1
test_one_arg[dispatch-cupy] (0001_2291482)     11.3337 (1.71)     13.9110 (1.82)     12.3475 (1.74)     0.5709 (1.92)     12.2268 (1.75)     0.5489 (1.39)         10;3   80.9880 (0.57)         41           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-device_array]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                         Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-device_array] (NOW)              1.6835 (1.0)      1.7297 (1.0)      1.7092 (1.0)      0.0188 (1.0)      1.7080 (1.0)      0.0324 (1.06)          3;0  585.0609 (1.0)           6           1
test_one_arg[dispatch-device_array] (0001_2291482)     1.7268 (1.03)     1.7738 (1.03)     1.7555 (1.03)     0.0197 (1.05)     1.7639 (1.03)     0.0306 (1.0)           1;0  569.6463 (0.97)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_one_arg[dispatch-torch]': 2 tests -------------------------------------------------------------------------------
Name (time in ms)                                   Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-torch] (NOW)               5.9903 (1.0)       6.3998 (1.0)       6.2124 (1.0)      0.1173 (1.0)       6.2603 (1.0)      0.2288 (1.0)          18;0  160.9694 (1.0)          52           1
test_one_arg[dispatch-torch] (0001_2291482)     10.5430 (1.76)     11.2606 (1.76)     10.8350 (1.74)     0.2048 (1.75)     10.7178 (1.71)     0.3808 (1.66)         14;0   92.2931 (0.57)         40           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_one_arg[signature-cupy]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                  Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-cupy] (0001_2291482)     4.5082 (1.0)      6.9082 (1.07)     5.3411 (1.02)     0.5337 (1.17)     5.3418 (1.06)     0.9416 (1.21)         81;0  187.2274 (0.98)        187           1
test_one_arg[signature-cupy] (NOW)              4.6543 (1.03)     6.4592 (1.0)      5.2385 (1.0)      0.4572 (1.0)      5.0360 (1.0)      0.7785 (1.0)          73;0  190.8960 (1.0)         198           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_one_arg[signature-device_array]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                          Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-device_array] (0001_2291482)     1.6652 (1.0)      2.4609 (1.14)     1.8046 (1.0)      0.1041 (1.44)     1.7810 (1.0)      0.0754 (1.0)         47;25  554.1507 (1.0)         456           1
test_one_arg[signature-device_array] (NOW)              1.7005 (1.02)     2.1632 (1.0)      1.8792 (1.04)     0.0726 (1.0)      1.9043 (1.07)     0.1381 (1.83)        178;1  532.1514 (0.96)        489           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_one_arg[signature-torch]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                   Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-torch] (NOW)              4.0829 (1.0)      4.5733 (1.0)      4.3557 (1.00)     0.0800 (1.0)      4.3773 (1.02)     0.0945 (1.0)         51;10  229.5819 (1.00)        220           1
test_one_arg[signature-torch] (0001_2291482)     4.1086 (1.01)     5.1700 (1.13)     4.3389 (1.0)      0.1538 (1.92)     4.2765 (1.0)      0.1822 (1.93)         27;7  230.4755 (1.0)         221           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
======================================================================================== 12 passed in 12.21s =========================================================================================

cpcloud · 2026-01-06T17:38:43Z

numba_cuda/numba/cuda/typing/typeof.py

    handled in numba.cuda.np.arrayobj.
    """
-    # Only handle constants, not arguments (arguments use regular array typing)
-    if c.purpose == Purpose.argument:


Returning None here caused from_cuda_array_interface to be called, which is much more expensive than just executing the rest of this code.

greptile-apps · 2026-01-06T17:39:15Z

Greptile Summary

This PR optimizes type inference for third-party device arrays (CuPy, PyTorch) that implement __cuda_array_interface__. Previously, the dispatcher would catch a ValueError and fall back to calling from_cuda_array_interface to create an intermediate DeviceNDArray, then type that object. Now it lets the typing system handle the interface directly, eliminating the intermediate object allocation and improving dispatch performance by up to 2.5x.

Key changes:

Removed redundant exception handling in typeof_pyval that called from_cuda_array_interface
Removed the Purpose.argument check in _typeof_cuda_array_interface to enable direct typing of CUDA array interface objects for arguments (not just constants)
Added @functools.lru_cache to strides_from_shape to memoize stride calculations
Minor code cleanups: simplified comparisons (not shape vs len(shape) == 0), used operator.eq with map instead of zip comprehensions, and tuple unpacking for readability

Confidence Score: 5/5

This PR is safe to merge with no identified risks
The changes are well-structured performance optimizations that simplify the code path by removing redundant logic. The removal of the Purpose.argument check in _typeof_cuda_array_interface is safe because the function now correctly handles both arguments and constants directly. The lru_cache decorator is correctly applied to a function with immutable, hashable arguments. All changes maintain the same logical behavior while improving performance.
No files require special attention

Important Files Changed

Filename	Overview
numba_cuda/numba/cuda/dispatcher.py	Simplified `typeof_pyval` by removing redundant exception handling and `from_cuda_array_interface` call - CUDA array interface objects now fall through to direct typing
numba_cuda/numba/cuda/typing/typeof.py	Removed `Purpose.argument` check to enable direct typing of CUDA array interface objects, added micro-optimizations with `operator.eq` and tuple unpacking
numba_cuda/numba/cuda/np/numpy_support.py	Added `@functools.lru_cache` to `strides_from_shape` for memoization, simplified zero-length check with `not shape`

cpcloud · 2026-01-06T17:54:39Z

/ok to test

cpcloud · 2026-01-06T18:04:13Z

Threw in a commit with some more shape/stride caching, which gives a bit more perf gains across the board relative to just removing the from_cuda_array_interface overhead.

cpcloud · 2026-01-06T18:10:00Z

/ok to test

- Add arch specific target support (NVIDIA#549) - chore: disable `locked` flag to bypass prefix-dev/pixi#5256 (NVIDIA#714) - ci: relock pixi (NVIDIA#712) - ci: remove redundant conda build in ci (NVIDIA#711) - chore(deps): bump numba-cuda version and relock pixi (NVIDIA#707) - Dropping bits in the old CI & Propagating recent changes from cuda-python (NVIDIA#683) - Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and `nvrtc` packages for CUDA 13 (NVIDIA#701) - perf: remove some exception control flow and buffer-exception penalization for arrays (NVIDIA#700) - perf: let CAI fall through instead of calling from_cuda_array_interface (NVIDIA#694) - chore: perf lint (NVIDIA#697) - chore(deps): bump deps in pixi lockfile (NVIDIA#693) - fix: use freethreading-supported `_PySet_NextItemRef` where possible (NVIDIA#682) - Support python `3.14` (NVIDIA#599) - Remove customized address space tracking and address class emission in debug info (NVIDIA#669) - Drop `experimental` from cuda.core namespace imports (NVIDIA#676) - Remove dangling references to NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (NVIDIA#675) - Use `rapidsai/sccache` in CI (NVIDIA#674) - chore(dev-deps): remove ipython and pyinstrument (NVIDIA#670) - Set up a new VM-based CI infrastructure (NVIDIA#604)

- Add arch specific target support (#549) - chore: disable `locked` flag to bypass prefix-dev/pixi#5256 (#714) - ci: relock pixi (#712) - ci: remove redundant conda build in ci (#711) - chore(deps): bump numba-cuda version and relock pixi (#707) - Dropping bits in the old CI & Propagating recent changes from cuda-python (#683) - Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and `nvrtc` packages for CUDA 13 (#701) - perf: remove some exception control flow and buffer-exception penalization for arrays (#700) - perf: let CAI fall through instead of calling from_cuda_array_interface (#694) - chore: perf lint (#697) - chore(deps): bump deps in pixi lockfile (#693) - fix: use freethreading-supported `_PySet_NextItemRef` where possible (#682) - Support python `3.14` (#599) - Remove customized address space tracking and address class emission in debug info (#669) - Drop `experimental` from cuda.core namespace imports (#676) - Remove dangling references to NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (#675) - Use `rapidsai/sccache` in CI (#674) - chore(dev-deps): remove ipython and pyinstrument (#670) - Set up a new VM-based CI infrastructure (#604)

perf: let CAI fall through instead of calling from_cuda_array_interface

b58b629

cpcloud requested a review from gmarkall January 6, 2026 17:36

cpcloud commented Jan 6, 2026

View reviewed changes

chore: remove unnecessary exception

61e2866

perf: cache more shape and stride computations

103c8ae

chore: remove == 0 comparison in favor of simpler not

68013da

gmarkall approved these changes Jan 7, 2026

View reviewed changes

cpcloud merged commit 53d1c78 into NVIDIA:main Jan 7, 2026
130 checks passed

cpcloud deleted the remove-unnecessary-cai branch January 7, 2026 14:55

gmarkall mentioned this pull request Jan 12, 2026

Bump version to 0.24.0 #716

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: let CAI fall through instead of calling from_cuda_array_interface #694

perf: let CAI fall through instead of calling from_cuda_array_interface #694

Uh oh!

cpcloud commented Jan 6, 2026

Uh oh!

copy-pr-bot bot commented Jan 6, 2026

Uh oh!

cpcloud commented Jan 6, 2026

Uh oh!

cpcloud commented Jan 6, 2026

Uh oh!

cpcloud Jan 6, 2026

Uh oh!

greptile-apps bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

cpcloud commented Jan 6, 2026

Uh oh!

cpcloud commented Jan 6, 2026

Uh oh!

cpcloud commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: let CAI fall through instead of calling from_cuda_array_interface #694

perf: let CAI fall through instead of calling from_cuda_array_interface #694

Uh oh!

Conversation

cpcloud commented Jan 6, 2026

Uh oh!

copy-pr-bot bot commented Jan 6, 2026

Uh oh!

cpcloud commented Jan 6, 2026

Uh oh!

cpcloud commented Jan 6, 2026

Uh oh!

cpcloud Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

cpcloud commented Jan 6, 2026

Uh oh!

cpcloud commented Jan 6, 2026

Uh oh!

cpcloud commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Jan 6, 2026 •

edited

Loading