-
Notifications
You must be signed in to change notification settings - Fork 55
refactor: remove devicearray code to reduce complexity #600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: remove devicearray code to reduce complexity #600
Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
gmarkall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code changes look good.
I need to spend a few minutes investigating a bit, because I am surprised that the "fast" path seemed to be slower than the default one. But, things have changed a lot since I initially implemented it, so it's also plausible we've drifted into a place where it no longer helps.
I will post benchmarks in a bit, but the difference isn't huge. I wouldn't have been surprised if it were the other way around (which was my initial interpretation until I realized it was the other way around). |
c9b0501 to
6628ea7
Compare
|
/ok to test |
1 similar comment
|
/ok to test |
6628ea7 to
f320ce6
Compare
|
Benchmarks are a little variable (here the best improvement is around 7%)
|
Greptile OverviewGreptile SummaryThis PR successfully removes the C++ extension-based
The implementation is correct and safe. The Confidence Score: 5/5
Important Files ChangedFile Analysis
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, no comments
My understanding is that the opposite is shown - the "fast path" being removed here is actually 0-4% slower than the "fallback", so removing this code is both a performance and complexity improvement. |
|
I've been experimenting with this locally, in two configurations:
The idea being that we're comparing With With this branch, I get: So it looks like performance is mostly the same, except for in I'm looking into why this could be. |
|
Also, it seems that this branch is faster in all cases with the |
d598a52 to
aa2a2bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, no comments
aa2a2bf to
2fd2e58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, no comments
2fd2e58 to
3bca842
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, no comments
868376c to
b89d7fd
Compare
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, no comments
b89d7fd to
8a6657d
Compare
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
numba_cuda/numba/cuda/tests/benchmarks/test_kernel_launch.py, line 42 (link)syntax: IDs are swapped -
cuda.jitis dispatch mode,cuda.jit("void(float32[::1])")is signature mode -
numba_cuda/numba/cuda/tests/benchmarks/test_kernel_launch.py, line 96 (link)syntax: IDs are swapped here too -
cuda.jitis dispatch mode,cuda.jit("void(...)")is signature mode
8 files reviewed, 2 comments
|
/ok to test |
e4101fe to
5d4c45b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, no comments
5d4c45b to
edfdf00
Compare
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, no comments
- Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643) - Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591) - feat: allow printing nested tuples (NVIDIA#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655) - build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652) - Test RAPIDS 25.12 (NVIDIA#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662) - feat: add print support for int64 tuples (NVIDIA#663) - Only run dependabot monthly and open fewer PRs (NVIDIA#658) - test: fix bogus `self` argument to `Context` (NVIDIA#656) - Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650) - Add support for dependabot (NVIDIA#647) - refactor: cull dead linker objects (NVIDIA#649) - Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609) - feat: add set_shared_memory_carveout (NVIDIA#629) - chore: bump version in pixi.toml (NVIDIA#641) - refactor: remove devicearray code to reduce complexity (NVIDIA#600)
- Capture global device arrays in kernels and device functions (#666) - Fix #624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (#643) - Fix Issue #588: separate compilation of NVVM IR modules when generating debuginfo (#591) - feat: allow printing nested tuples (#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (#655) - build(deps): bump actions/upload-artifact from 4 to 5 (#652) - Test RAPIDS 25.12 (#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (#662) - feat: add print support for int64 tuples (#663) - Only run dependabot monthly and open fewer PRs (#658) - test: fix bogus `self` argument to `Context` (#656) - Fix false negative NRT link decision when NRT was previously toggled on (#650) - Add support for dependabot (#647) - refactor: cull dead linker objects (#649) - Migrate numba-cuda driver to use cuda.core.launch API (#609) - feat: add set_shared_memory_carveout (#629) - chore: bump version in pixi.toml (#641) - refactor: remove devicearray code to reduce complexity (#600)
v0.23.0 - Capture global device arrays in kernels and device functions (NVIDIA#666) - Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643) - Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591) - feat: allow printing nested tuples (NVIDIA#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655) - build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652) - Test RAPIDS 25.12 (NVIDIA#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662) - feat: add print support for int64 tuples (NVIDIA#663) - Only run dependabot monthly and open fewer PRs (NVIDIA#658) - test: fix bogus `self` argument to `Context` (NVIDIA#656) - Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650) - Add support for dependabot (NVIDIA#647) - refactor: cull dead linker objects (NVIDIA#649) - Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609) - feat: add set_shared_memory_carveout (NVIDIA#629) - chore: bump version in pixi.toml (NVIDIA#641) - refactor: remove devicearray code to reduce complexity (NVIDIA#600)



Overview
This PR removes the C++ extension-based
DeviceArrayclass and replaces it with a pure Python implementation, significantly reducing codebase complexity while maintaining functionality. The changes also introduce type computation caching to mitigate potential performance overhead.Changes
Implementation Changes
numba_cuda/numba/cuda/cudadrv/devicearray.py
DeviceNDArrayBasenow inherits from object instead of_devicearray.DeviceArray_numba_type_instance attribute caching in_numba_type_property to avoid repeated type computationPerformance Considerations
Removed Optimization: The C++ implementation provided fast-path type fingerprinting through direct table lookup for device arrays during dispatch:
[ndim-1][layout][dtype]Mitigation: Instance-level caching of
_numba_type_property reduces repeated type computation overhead:DeviceNDArrayBaseinstanceExpected Impact:
StridedMemoryView-based oneTrade-offs
Benefits:
Costs: