Skip to content

Conversation

@cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Nov 19, 2025

Overview

This PR removes the C++ extension-based DeviceArray class and replaces it with a pure Python implementation, significantly reducing codebase complexity while maintaining functionality. The changes also introduce type computation caching to mitigate potential performance overhead.

Changes

Implementation Changes

numba_cuda/numba/cuda/cudadrv/devicearray.py

  1. DeviceNDArrayBase now inherits from object instead of _devicearray.DeviceArray
  2. Added _numba_type_ instance attribute caching in _numba_type_ property to avoid repeated type computation
    • Type computation now occurs once per array instance
    • Subsequent accesses return cached value via walrus operator pattern

Performance Considerations

Removed Optimization: The C++ implementation provided fast-path type fingerprinting through direct table lookup for device arrays during dispatch:

  • Cached type codes indexed by [ndim-1][layout][dtype]
  • Avoided Python property access and type construction for common array configurations

Mitigation: Instance-level caching of _numba_type_ property reduces repeated type computation overhead:

  • Type is computed once per DeviceNDArrayBase instance
  • Subsequent kernel launches with the same array instance use cached type
  • Overhead primarily affects first use of each array instance

Expected Impact:

  • Potential slight slowdown on first kernel launch with a given array (fallback to fingerprinting instead of table lookup)
  • Negligible impact on subsequent launches with the same array instance due to caching
  • Reduced build complexity
  • Easier experimentation with swapping the underlying implementation with StridedMemoryView-based one

Trade-offs

Benefits:

  • Reduced codebase complexity (~370 lines of C++ code removed)
  • Simplified build process (one fewer extension module)
  • Easier maintenance and debugging

Costs:

  • Potential minor performance regression on first kernel launch with new array instances

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 19, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cpcloud cpcloud requested a review from gmarkall November 19, 2025 16:32
@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

/ok to test

Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes look good.

I need to spend a few minutes investigating a bit, because I am surprised that the "fast" path seemed to be slower than the default one. But, things have changed a lot since I initially implemented it, so it's also plausible we've drifted into a place where it no longer helps.

@gmarkall gmarkall added the 3 - Ready for Review Ready for review by team label Nov 19, 2025
@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

I am surprised that the "fast" path seemed to be slower than the default one

I will post benchmarks in a bit, but the difference isn't huge. I wouldn't have been surprised if it were the other way around (which was my initial interpretation until I realized it was the other way around).

@cpcloud cpcloud force-pushed the remove-devicearray-dispatching-code branch 2 times, most recently from c9b0501 to 6628ea7 Compare November 19, 2025 18:36
@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

/ok to test

1 similar comment
@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

/ok to test

@cpcloud cpcloud force-pushed the remove-devicearray-dispatching-code branch from 6628ea7 to f320ce6 Compare November 19, 2025 18:38
@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

Benchmarks are a little variable (here the best improvement is around 7%)

  • NOW is this PR
  • 0001_1478723 is the first benchmark result set at commit 1478723 (the first commit here which contains only the new benchmarking code)
image

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 21, 2025

Greptile Overview

Greptile Summary

This PR successfully removes the C++ extension-based DeviceArray implementation in favor of a pure Python approach, eliminating ~370 lines of C++ code and simplifying the build process. The key changes include:

  • Core refactor: DeviceNDArrayBase now inherits from object instead of the C++ _devicearray.DeviceArray class
  • Performance mitigation: Added @functools.cached_property to _numba_type_ to cache type computation per array instance
  • Type dispatch change: Removed fast-path table lookup (typecode_devicendarray) in favor of fingerprinting-based dispatch that leverages the Python _numba_type_ property
  • Build simplification: Removed _devicearray extension module from setup.py
  • Testing infrastructure: Added dispatch vs signature benchmarking variants to measure performance impact

The implementation is correct and safe. The cached_property decorator is appropriate since array properties (shape, strides, dtype) are immutable after construction. Device arrays will now follow the standard fingerprinting path, accessing the cached _numba_type_ property.

Confidence Score: 5/5

  • Safe to merge - well-structured refactoring with appropriate performance mitigation
  • The code changes are clean, correct, and well-documented. The removal of C++ code reduces complexity without introducing bugs. The cached_property implementation is safe because array properties are immutable. Benchmarking infrastructure confirms performance considerations were addressed.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
numba_cuda/numba/cuda/cudadrv/devicearray.py 5/5 Changed base class from C++ _devicearray.DeviceArray to pure Python object, added @functools.cached_property decorator to _numba_type_ for performance optimization
numba_cuda/numba/cuda/cext/_dispatcher.cpp 5/5 Removed devicearray import function and initialization call, cleaned up header includes
numba_cuda/numba/cuda/cext/_typeof.cpp 5/5 Removed fast-path typecode_devicendarray function (~100 lines), device arrays now use fingerprinting fallback
setup.py 5/5 Removed ext_devicearray extension module from build configuration

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@gmarkall
Copy link
Contributor

Benchmarks added in a previous commit show the C extension provides minimal performance benefit (0-4% faster) which doesn't justify the added complexity.

My understanding is that the opposite is shown - the "fast path" being removed here is actually 0-4% slower than the "fallback", so removing this code is both a performance and complexity improvement.

@gmarkall
Copy link
Contributor

I've been experimenting with this locally, in two configurations:

  • main branch at commit d08e8a9 and the updated pytest.ini and test_kernel_launch.py from this branch checked out, so the benchmarks run are the same.
  • This branch checked out, with main from the above commit merged in

The idea being that we're comparing main with and without this PR applied.

With main the timings I get are:

--------------------------------------------------------------------------------------------------------- benchmark: 8 tests --------------------------------------------------------------------------------------------------------
Name (time in us)                                   Min                     Max                    Mean              StdDev                  Median                 IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-device_array]            748.7840 (1.0)        1,008.8570 (1.23)         776.2828 (1.00)      15.2857 (1.90)         775.0270 (1.00)       9.2160 (1.12)       116;70  1,288.1904 (1.00)       1158           1
test_one_arg[signature-device_array]           753.5910 (1.01)         819.9070 (1.0)          774.2301 (1.0)        8.0595 (1.0)          773.6250 (1.0)        8.2115 (1.0)          10;2  1,291.6056 (1.0)          67           1
test_one_arg[dispatch-cupy]                  1,839.6880 (2.46)       2,325.0710 (2.84)       1,903.9161 (2.46)      32.7335 (4.06)       1,898.5565 (2.45)      22.8490 (2.78)        60;35    525.2332 (0.41)        486           1
test_one_arg[signature-cupy]                 5,387.6110 (7.20)       5,718.3980 (6.97)       5,498.9842 (7.10)      53.1874 (6.60)       5,496.8680 (7.11)      66.7358 (8.13)         16;2    181.8518 (0.14)         69           1
test_many_args[dispatch-device_array]        6,061.3230 (8.09)       6,354.0870 (7.75)       6,133.8225 (7.92)      33.8836 (4.20)       6,131.5130 (7.93)      38.2100 (4.65)         31;3    163.0305 (0.13)        160           1
test_many_args[signature-device_array]       5,959.2460 (7.96)       6,214.8450 (7.58)       6,039.5247 (7.80)      41.9748 (5.21)       6,036.7950 (7.80)      48.5348 (5.91)         10;1    165.5759 (0.13)         53           1
test_many_args[dispatch-cupy]               26,816.2350 (35.81)     27,309.8710 (33.31)     26,934.2632 (34.79)     90.4253 (11.22)     26,917.1950 (34.79)     76.1835 (9.28)          5;2     37.1274 (0.03)         35           1
test_many_args[signature-cupy]             101,691.6180 (135.81)   102,231.6370 (124.69)   101,974.4584 (131.71)   201.6629 (25.02)    102,016.2960 (131.87)   333.7150 (40.64)         3;0      9.8064 (0.01)          9           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

With this branch, I get:

---------------------------------------------------------------------------------------------------------- benchmark: 8 tests ---------------------------------------------------------------------------------------------------------
Name (time in us)                                   Min                     Max                    Mean              StdDev                  Median                   IQR            Outliers         OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-device_array]            740.8460 (1.0)          962.4230 (1.0)          772.3301 (1.0)       16.1450 (1.0)          769.9750 (1.0)          9.2210 (1.0)       250;171  1,294.7831 (1.0)        1135           1
test_one_arg[signature-device_array]         1,261.4590 (1.70)       1,383.1550 (1.44)       1,330.0216 (1.72)      26.6591 (1.65)       1,329.2700 (1.73)        37.7452 (4.09)         13;0    751.8675 (0.58)         45           1
test_one_arg[dispatch-cupy]                  1,827.6040 (2.47)       2,132.8190 (2.22)       1,880.6175 (2.43)      32.6976 (2.03)       1,874.8390 (2.43)        22.8098 (2.47)        60;22    531.7402 (0.41)        465           1
test_one_arg[signature-cupy]                 5,374.7870 (7.25)       6,002.8690 (6.24)       5,512.2594 (7.14)     101.2299 (6.27)       5,492.6985 (7.13)        74.6990 (8.10)          7;3    181.4138 (0.14)         50           1
test_many_args[dispatch-device_array]        5,830.0840 (7.87)       6,124.8530 (6.36)       5,918.4807 (7.66)      51.1397 (3.17)       5,900.9190 (7.66)        66.8490 (7.25)         44;2    168.9623 (0.13)        164           1
test_many_args[signature-device_array]      17,224.7620 (23.25)     17,850.9780 (18.55)     17,392.8900 (22.52)    146.1040 (9.05)      17,359.4530 (22.55)      220.6965 (23.93)         5;1     57.4948 (0.04)         25           1
test_many_args[dispatch-cupy]               26,270.7640 (35.46)     26,632.8730 (27.67)     26,389.2848 (34.17)     75.3687 (4.67)      26,380.0700 (34.26)       67.9575 (7.37)          7;3     37.8942 (0.03)         37           1
test_many_args[signature-cupy]             103,262.9600 (139.39)   105,188.3970 (109.30)   104,006.8155 (134.67)   740.4433 (45.86)    103,701.8830 (134.68)   1,205.4975 (130.73)        3;0      9.6148 (0.01)          8           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So it looks like performance is mostly the same, except for in test_many_args[signature-device_array], which is about 3x slower with this branch.

I'm looking into why this could be.

@gmarkall
Copy link
Contributor

Also, it seems that this branch is faster in all cases with the dispatch variant of the tests. This is the more important use case (it's not normally recommended to add signatures to kernels).

@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 2, 2025

There's a straightforward way to see the comparisons against each other, so that it's clear where the differences are:

git checkout $THIS_BRANCH
git checkout HEAD~ # the commit with just the benchmarks
pixi run -e cu-12-9-py312 bench
git checkout $THIS_BRANCH
pixi reinstall -e cu-12-9-py312 numba-cuda # to ensure that the changes to the C extensions are compiled and installed into the environment
pixi run -e cu-12-9-py312 benchcmp 0001 # 0001 is an autogenerated name from the previous run

which results in

image

I may wrap this up in a pixi task to reduce the number of steps needed to do the comparison.

@cpcloud cpcloud force-pushed the remove-devicearray-dispatching-code branch 2 times, most recently from d598a52 to aa2a2bf Compare December 5, 2025 15:08
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cpcloud cpcloud force-pushed the remove-devicearray-dispatching-code branch from aa2a2bf to 2fd2e58 Compare December 5, 2025 17:59
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cpcloud cpcloud force-pushed the remove-devicearray-dispatching-code branch from 2fd2e58 to 3bca842 Compare December 5, 2025 18:12
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cpcloud cpcloud force-pushed the remove-devicearray-dispatching-code branch 2 times, most recently from 868376c to b89d7fd Compare December 5, 2025 18:47
@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 5, 2025

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cpcloud cpcloud force-pushed the remove-devicearray-dispatching-code branch from b89d7fd to 8a6657d Compare December 5, 2025 19:09
@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 5, 2025

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. numba_cuda/numba/cuda/tests/benchmarks/test_kernel_launch.py, line 42 (link)

    syntax: IDs are swapped - cuda.jit is dispatch mode, cuda.jit("void(float32[::1])") is signature mode

  2. numba_cuda/numba/cuda/tests/benchmarks/test_kernel_launch.py, line 96 (link)

    syntax: IDs are swapped here too - cuda.jit is dispatch mode, cuda.jit("void(...)") is signature mode

8 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 5, 2025

/ok to test

@cpcloud cpcloud force-pushed the remove-devicearray-dispatching-code branch from e4101fe to 5d4c45b Compare December 5, 2025 19:16
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 5, 2025

Here are the new benchmarks:

image

At most there's a 7% slowdown, likely from the initial cost of computing the numba type.

@cpcloud cpcloud force-pushed the remove-devicearray-dispatching-code branch from 5d4c45b to edfdf00 Compare December 8, 2025 15:35
@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 8, 2025

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cpcloud cpcloud enabled auto-merge (squash) December 8, 2025 15:42
@cpcloud cpcloud merged commit 15750e6 into NVIDIA:main Dec 8, 2025
71 checks passed
@cpcloud cpcloud deleted the remove-devicearray-dispatching-code branch December 8, 2025 16:16
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Dec 17, 2025
- Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643)
- Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591)
- feat: allow printing nested tuples (NVIDIA#667)
- build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655)
- build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652)
- Test RAPIDS 25.12 (NVIDIA#661)
- Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662)
- feat: add print support for int64 tuples (NVIDIA#663)
- Only run dependabot monthly and open fewer PRs (NVIDIA#658)
- test: fix bogus `self` argument to `Context` (NVIDIA#656)
- Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650)
- Add support for dependabot (NVIDIA#647)
- refactor: cull dead linker objects (NVIDIA#649)
- Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609)
- feat: add set_shared_memory_carveout (NVIDIA#629)
- chore: bump version in pixi.toml (NVIDIA#641)
- refactor: remove devicearray code to reduce complexity (NVIDIA#600)
@gmarkall gmarkall mentioned this pull request Dec 17, 2025
gmarkall added a commit that referenced this pull request Dec 17, 2025
- Capture global device arrays in kernels and device functions (#666)
- Fix #624: Accept Numba IR
nodes in all places Numba-CUDA IR nodes are expected
(#643)
- Fix Issue #588: separate
compilation of NVVM IR modules when generating debuginfo
(#591)
- feat: allow printing nested tuples
(#667)
- build(deps): bump actions/setup-python from 5.6.0 to 6.1.0
(#655)
- build(deps): bump actions/upload-artifact from 4 to 5
(#652)
- Test RAPIDS 25.12 (#661)
- Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests
(#662)
- feat: add print support for int64 tuples
(#663)
- Only run dependabot monthly and open fewer PRs
(#658)
- test: fix bogus `self` argument to `Context`
(#656)
- Fix false negative NRT link decision when NRT was previously toggled
on (#650)
- Add support for dependabot
(#647)
- refactor: cull dead linker objects
(#649)
- Migrate numba-cuda driver to use cuda.core.launch API
(#609)
- feat: add set_shared_memory_carveout
(#629)
- chore: bump version in pixi.toml
(#641)
- refactor: remove devicearray code to reduce complexity
(#600)
ZzEeKkAa added a commit to ZzEeKkAa/numba-cuda that referenced this pull request Jan 8, 2026
v0.23.0

- Capture global device arrays in kernels and device functions (NVIDIA#666)
- Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643)
- Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591)
- feat: allow printing nested tuples (NVIDIA#667)
- build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655)
- build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652)
- Test RAPIDS 25.12 (NVIDIA#661)
- Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662)
- feat: add print support for int64 tuples (NVIDIA#663)
- Only run dependabot monthly and open fewer PRs (NVIDIA#658)
- test: fix bogus `self` argument to `Context` (NVIDIA#656)
- Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650)
- Add support for dependabot (NVIDIA#647)
- refactor: cull dead linker objects (NVIDIA#649)
- Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609)
- feat: add set_shared_memory_carveout (NVIDIA#629)
- chore: bump version in pixi.toml (NVIDIA#641)
- refactor: remove devicearray code to reduce complexity (NVIDIA#600)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants