[WIP] Port numpy reduction tests to CUDA by brandon-b-miller · Pull Request #523 · NVIDIA/numba-cuda

brandon-b-miller · 2025-10-14T15:29:27Z

This works towards a reimplementation of upstream test_array_reductions.py with the goal being allocating a numpy array on a single thread and performing the CPU check using the result.

cc @atmnp

copy-pr-bot · 2025-10-14T15:29:30Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

brandon-b-miller · 2025-10-14T15:52:24Z

numba_cuda/numba/cuda/cpython/listobj.py

        fnty = ir.FunctionType(ir.VoidType(), [cgutils.voidptr_t])
        fn = cgutils.get_or_insert_function(
-            mod, fnty, ".dtor.list.{}".format(self.dtype)
+            mod, fnty, "numba_cuda_dtor_list_{}".format(self.dtype)


NVVM has special rules about what things can be named, and apparently the name of a variable can't start with a period (or contain one).

numba.cuda.cudadrv.error.NvvmError: Failed to verify error: Error: : Global Value `.dtor.list.float64': Invalid identifier name: .dtor.list.float64 Must match [a-zA-Z$_][a-zA-Z$_0-9]*

numba_cuda/numba/cuda/target.py

copy-pr-bot · 2025-10-15T14:53:10Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

brandon-b-miller · 2025-10-15T14:53:19Z

/ok to test

brandon-b-miller · 2025-10-15T18:00:34Z

/ok to test

brandon-b-miller · 2025-10-15T22:24:50Z

/ok to test

brandon-b-miller · 2025-10-16T13:30:54Z

/ok to test

numba_cuda/numba/cuda/tests/test_array_reductions.py

cpcloud

LGTM, no comments are blocking.

numba_cuda/numba/cuda/cpython/listobj.py

numba_cuda/numba/cuda/memory_management/nrt.cu

Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>

brandon-b-miller · 2025-10-22T18:41:20Z

/ok to test

gmarkall · 2025-10-27T17:19:08Z

numba_cuda/numba/cuda/tests/test_array_reductions.py

This should probably be in cudapy. I just realised from this, that we have a bunch of other uncategorized tests (test_byteflow, etc.) that have accidentally crept in too.

Yeah, I noticed that too. Should everything frontend related be tested under cudapy then? Happy to move the rest of them before/after this PR.

Yeah, I think cudapy makes sense for all frontend-related stuff.

gmarkall

Accidentally started a review when trying to leave a comment - please ignore this.

numba_cuda/numba/cuda/tests/test_array_reductions.py

brandon-b-miller · 2025-10-27T21:16:05Z

/ok to test

brandon-b-miller · 2025-10-27T22:08:39Z

/ok to test

gmarkall · 2025-10-28T10:43:53Z

numba_cuda/numba/cuda/device_init.py


-from numba.cuda.misc.special import literal_unroll
-from numba.cuda.misc import literal
+# from numba.cuda.misc.special import literal_unroll


The purpose of importing this is to provide numba.cuda.literal_unroll() as a public API, so if we're going to use / support literal_unroll() (which I think is a good thing, and we should do it), we need this import.

gmarkall · 2025-10-28T10:46:35Z

numba_cuda/numba/cuda/device_init.py

-from numba.cuda.misc.special import literal_unroll
-from numba.cuda.misc import literal
+# from numba.cuda.misc.special import literal_unroll
+# from numba.cuda.misc import literal


I added this import so that the overload for literal_unroll() would get registered. Without it, the implementation won't be found. It doesn't necessarily need to be imported here (and if we want to avoid exposing it, we could do del literal afterwards) but it needs to get imported somehow when everything is initialized / imported.

This is partially solvable easily - I believe the imports themselves can be added to the target context load_additional_registries function with the effect of having import delayed until a compilation is invoked. However the public availability of the API as from numba.cuda import literal violates the test that checks that we don't register lowerings upon import.

To me there are two solutions, either refactor the literal module such that it registers lowering in a lazy way (and remove it from the banlist), or implement a pep 562 style module level gettattr that delays the actual import of literal until a user attempts to access it (or the context has already imported it).

Something to get us off the ground at f210837

The import of

from numba.cuda.misc.special import literal_unroll

needs to be here so that we provide the API for public use.

I think we can remove numba.cuda.misc from the banlist, as it doesn't overload anything that affects other targets - I would suggest doing that for this PR.

There are other CUDA modules (e.g. numba.cuda.cg) that register overloads for the CUDA target on import, so I don't think there's a problem with doing it. The banlist in that test has grown as we've vendored code, but it's protecting against the risk of us polluting the CPU target, so a lot of it is unnecessary.

numba_cuda/numba/cuda/typing/context.py

…egistered a few things from npydecl to support NumPy dtypes in kernels and a small set of ufuncs. Now that the real npydecl is registered, these items should not also be registered in cudadecl, because this leads to an erroneous double-registration. Co-authored-by: Graham Markall <gmarkall@nvidia.com>

brandon-b-miller · 2025-10-28T13:20:22Z

/ok to test

numba_cuda/numba/cuda/ufuncs.py

brandon-b-miller · 2025-10-28T15:19:11Z

/ok to test

brandon-b-miller · 2025-10-28T15:33:26Z

/ok to test

Follow up for #523 (comment) This moves some test files to the cudapy namespace and merges `tests/test_extending` and `cudapy/test_extending.py`.

- Add support for cache-hinted load and store operations (NVIDIA#587) - Add more thirdparty tests (NVIDIA#586) - Add sphinx-lint to pre-commit and fix errors (NVIDIA#597) - Add DWARF variant part support for polymorphic variables in CUDA debug info (NVIDIA#544) - chore: clean up dead workaround for unavailable `lru_cache` (NVIDIA#598) - chore(docs): format types docs (NVIDIA#596) - refactor: decouple `Context` from `Stream` and `Event` objects (NVIDIA#579) - Fix freezing in of constant arrays with negative strides (NVIDIA#589) - Update tests to accept variants of generated PTX (NVIDIA#585) - refactor: replace device functionality with `cuda.core` APIs (NVIDIA#581) - Move frontend tests to `cudapy` namespace (NVIDIA#558) - Generalize the concurrency group for main merges (NVIDIA#582) - ci: move pre-commit checks to pre commit action (NVIDIA#577) - chore(pixi): set up doc builds; remove most `build-conda` dependencies (NVIDIA#574) - ci: ensure that python version in ci matches matrix (NVIDIA#575) - Fix the `cuda.is_supported_version()` API (NVIDIA#571) - Fix checks on main (NVIDIA#576) - feat: add `math.nextafter` (NVIDIA#543) - ci: replace conda testing with pixi (NVIDIA#554) - [CI] Run PR workflow on merge to main (NVIDIA#572) - Propose Alternative Module Path for `ext_types` and Maintain `numba.cuda.types.bfloat16` Import API (NVIDIA#569) - test: enable fail-on-warn and clean up resulting failures (NVIDIA#529) - [Refactor][NFC] Vendor-in compiler_lock for future CUDA-specific changes (NVIDIA#565) - Fix registration with Numba, vendor MakeFunctionToJITFunction tests (NVIDIA#566) - [Refactor][NFC][Cleanups] Update imports to upstream numba to use the numba.cuda modules (NVIDIA#561) - test: refactor process-based tests to use concurrent futures in order to simplify tests (NVIDIA#550) - test: revert back to ipc futures that await each iteration (NVIDIA#564) - chore(deps): move to self-contained pixi.toml to avoid mixed-pypi-pixi environments (NVIDIA#551) - [Refactor][NFC] Vendor-in errors for future CUDA-specific changes (NVIDIA#534) - Remove dependencies on target_extension for CUDA target (NVIDIA#555) - Relax the pinning to `cuda-core` to allow it floating across minor releases (NVIDIA#559) - [WIP] Port numpy reduction tests to CUDA (NVIDIA#523) - ci: add timeout to avoid blocking the job queue (NVIDIA#556) - Handle `cuda.core.Stream` in driver operations (NVIDIA#401) - feat: add support for `math.exp2` (NVIDIA#541) - Vendor in types and datamodel for CUDA-specific changes (NVIDIA#533) - refactor: cleanup device constructor (NVIDIA#548) - bench: add cupy to array constructor kernel launch benchmarks (NVIDIA#547) - perf: cache dimension computations (NVIDIA#542) - perf: remove duplicated size computation (NVIDIA#537) - chore(perf): add torch to benchmark (NVIDIA#539) - test: speed up ipc tests by ~6.5x (NVIDIA#527) - perf: speed up kernel launch (NVIDIA#510) - perf: remove context threading in various pointer abstractions (NVIDIA#536) - perf: reduce the number of `__cuda_array_interface__` accesses (NVIDIA#538) - refactor: remove unnecessary custom map and set implementations (NVIDIA#530) - [Refactor][NFC] Vendor-in vectorize decorators for future CUDA-specific changes (NVIDIA#513) - test: add benchmarks for kernel launch for reproducibility (NVIDIA#528) - test(pixi): update pixi testing command to work with the new `testing` directory (NVIDIA#522) - refactor: fully remove `USE_NV_BINDING` (NVIDIA#525) - Draft: Vendor in the IR module (NVIDIA#439) - pyproject.toml: add search path for Pyrefly (NVIDIA#524) - Vendor in numba.core.typing for CUDA-specific changes (NVIDIA#473) - Use numba.config when available, otherwise use numba.cuda.config (NVIDIA#497) - [MNT] Drop NUMBA_CUDA_USE_NVIDIA_BINDING; always use cuda.core and cuda.bindings as fallback (NVIDIA#479) - Vendor in dispatcher, entrypoints, pretty_annotate for CUDA-specific changes (NVIDIA#502) - build: allow parallelization of nvcc testing builds (NVIDIA#521) - chore(dev-deps): add pixi (NVIDIA#505) - Vendor the imputils module for CUDA refactoring (NVIDIA#448) - Don't use `MemoryLeakMixin` for tests that don't use NRT (NVIDIA#519) - Switch back to stable cuDF release in thirdparty tests (NVIDIA#518) - Updating .gitignore with binaries in the `testing` folder (NVIDIA#516) - Remove some unnecessary uses of ContextResettingTestCase (NVIDIA#507) - Vendor in _helperlib cext for CUDA-specific changes (NVIDIA#512) - Vendor in typeconv for future CUDA-specific changes (NVIDIA#499) - [Refactor][NFC] Vendor-in numba.cpython modules for future CUDA-specific changes (NVIDIA#493) - [Refactor][NFC] Vendor-in numba.np modules for future CUDA-specific changes (NVIDIA#494) - Make the CUDA target the default for CUDA overload decorators (NVIDIA#511) - Remove C extension loading hacks (NVIDIA#506) - Ensure NUMBA can manipulate memory from CUDA graphs before the graph is launched (NVIDIA#437) - [Refactor][NFC] Vendor-in core Numba analysis utils for CUDA-specific changes (NVIDIA#433) - Fix Bf16 Test OB Error (NVIDIA#509) - Vendor in components from numba.core.runtime for CUDA-specific changes (NVIDIA#498) - [Refactor] Vendor in _dispatcher, _devicearray, mviewbuf C extension for CUDA-specific customization (NVIDIA#373) - [MNT] Managed UM memset fallback and skip CUDA IPC tests on WSL2 (NVIDIA#488) - Improve debug value range coverage (NVIDIA#461) - Add `compile_all` API (NVIDIA#484) - Vendor in core.registry for CUDA-specific changes (NVIDIA#485) - [Refactor][NFC] Vendor in numba.misc for CUDA-specific changes (NVIDIA#457) - Vendor in optional, boxing for CUDA-specific changes, fix dangling imports (NVIDIA#476) - [test] Remove dependency on cpu_target (NVIDIA#490) - Change dangling imports of numba.core.lowering to numba.cuda.lowering (NVIDIA#475) - [test] Use numpy's tolerance for float16 (NVIDIA#491) - [Refactor][NFC] Vendor-in numba.extending for future CUDA-specific changes (NVIDIA#466) - [Refactor][NFC] Vendor-in more cpython registries for future CUDA-specific changes (NVIDIA#478)

- Add support for cache-hinted load and store operations (#587) - Add more thirdparty tests (#586) - Add sphinx-lint to pre-commit and fix errors (#597) - Add DWARF variant part support for polymorphic variables in CUDA debug info (#544) - chore: clean up dead workaround for unavailable `lru_cache` (#598) - chore(docs): format types docs (#596) - refactor: decouple `Context` from `Stream` and `Event` objects (#579) - Fix freezing in of constant arrays with negative strides (#589) - Update tests to accept variants of generated PTX (#585) - refactor: replace device functionality with `cuda.core` APIs (#581) - Move frontend tests to `cudapy` namespace (#558) - Generalize the concurrency group for main merges (#582) - ci: move pre-commit checks to pre commit action (#577) - chore(pixi): set up doc builds; remove most `build-conda` dependencies (#574) - ci: ensure that python version in ci matches matrix (#575) - Fix the `cuda.is_supported_version()` API (#571) - Fix checks on main (#576) - feat: add `math.nextafter` (#543) - ci: replace conda testing with pixi (#554) - [CI] Run PR workflow on merge to main (#572) - Propose Alternative Module Path for `ext_types` and Maintain `numba.cuda.types.bfloat16` Import API (#569) - test: enable fail-on-warn and clean up resulting failures (#529) - [Refactor][NFC] Vendor-in compiler_lock for future CUDA-specific changes (#565) - Fix registration with Numba, vendor MakeFunctionToJITFunction tests (#566) - [Refactor][NFC][Cleanups] Update imports to upstream numba to use the numba.cuda modules (#561) - test: refactor process-based tests to use concurrent futures in order to simplify tests (#550) - test: revert back to ipc futures that await each iteration (#564) - chore(deps): move to self-contained pixi.toml to avoid mixed-pypi-pixi environments (#551) - [Refactor][NFC] Vendor-in errors for future CUDA-specific changes (#534) - Remove dependencies on target_extension for CUDA target (#555) - Relax the pinning to `cuda-core` to allow it floating across minor releases (#559) - [WIP] Port numpy reduction tests to CUDA (#523) - ci: add timeout to avoid blocking the job queue (#556) - Handle `cuda.core.Stream` in driver operations (#401) - feat: add support for `math.exp2` (#541) - Vendor in types and datamodel for CUDA-specific changes (#533) - refactor: cleanup device constructor (#548) - bench: add cupy to array constructor kernel launch benchmarks (#547) - perf: cache dimension computations (#542) - perf: remove duplicated size computation (#537) - chore(perf): add torch to benchmark (#539) - test: speed up ipc tests by ~6.5x (#527) - perf: speed up kernel launch (#510) - perf: remove context threading in various pointer abstractions (#536) - perf: reduce the number of `__cuda_array_interface__` accesses (#538) - refactor: remove unnecessary custom map and set implementations (#530) - [Refactor][NFC] Vendor-in vectorize decorators for future CUDA-specific changes (#513) - test: add benchmarks for kernel launch for reproducibility (#528) - test(pixi): update pixi testing command to work with the new `testing` directory (#522) - refactor: fully remove `USE_NV_BINDING` (#525) - Draft: Vendor in the IR module (#439) - pyproject.toml: add search path for Pyrefly (#524) - Vendor in numba.core.typing for CUDA-specific changes (#473) - Use numba.config when available, otherwise use numba.cuda.config (#497) - [MNT] Drop NUMBA_CUDA_USE_NVIDIA_BINDING; always use cuda.core and cuda.bindings as fallback (#479) - Vendor in dispatcher, entrypoints, pretty_annotate for CUDA-specific changes (#502) - build: allow parallelization of nvcc testing builds (#521) - chore(dev-deps): add pixi (#505) - Vendor the imputils module for CUDA refactoring (#448) - Don't use `MemoryLeakMixin` for tests that don't use NRT (#519) - Switch back to stable cuDF release in thirdparty tests (#518) - Updating .gitignore with binaries in the `testing` folder (#516) - Remove some unnecessary uses of ContextResettingTestCase (#507) - Vendor in _helperlib cext for CUDA-specific changes (#512) - Vendor in typeconv for future CUDA-specific changes (#499) - [Refactor][NFC] Vendor-in numba.cpython modules for future CUDA-specific changes (#493) - [Refactor][NFC] Vendor-in numba.np modules for future CUDA-specific changes (#494) - Make the CUDA target the default for CUDA overload decorators (#511) - Remove C extension loading hacks (#506) - Ensure NUMBA can manipulate memory from CUDA graphs before the graph is launched (#437) - [Refactor][NFC] Vendor-in core Numba analysis utils for CUDA-specific changes (#433) - Fix Bf16 Test OB Error (#509) - Vendor in components from numba.core.runtime for CUDA-specific changes (#498) - [Refactor] Vendor in _dispatcher, _devicearray, mviewbuf C extension for CUDA-specific customization (#373) - [MNT] Managed UM memset fallback and skip CUDA IPC tests on WSL2 (#488) - Improve debug value range coverage (#461) - Add `compile_all` API (#484) - Vendor in core.registry for CUDA-specific changes (#485) - [Refactor][NFC] Vendor in numba.misc for CUDA-specific changes (#457) - Vendor in optional, boxing for CUDA-specific changes, fix dangling imports (#476) - [test] Remove dependency on cpu_target (#490) - Change dangling imports of numba.core.lowering to numba.cuda.lowering (#475) - [test] Use numpy's tolerance for float16 (#491) - [Refactor][NFC] Vendor-in numba.extending for future CUDA-specific changes (#466) - [Refactor][NFC] Vendor-in more cpython registries for future CUDA-specific changes (#478)

basic test, bits and pieces of nrt needed

beeadd1

brandon-b-miller commented Oct 14, 2025

View reviewed changes

numba_cuda/numba/cuda/target.py Show resolved Hide resolved

brandon-b-miller added 3 commits October 14, 2025 13:30

some progress

634e976

Refactor

9e12144

more reductions

c4d4abe

gmarkall added the 2 - In Progress Currently a work in progress label Oct 15, 2025

fix ufuncs

06226f2

brandon-b-miller marked this pull request as ready for review October 15, 2025 14:53

brandon-b-miller added 2 commits October 15, 2025 10:51

Merge branch 'main' into vendor-test-array-reductions

2f48967

enable nrt

6611805

fixes

7bc54a0

brandon-b-miller added 2 commits October 16, 2025 06:07

pass

8f87a44

faster?

9969223

brandon-b-miller commented Oct 16, 2025

View reviewed changes

numba_cuda/numba/cuda/tests/test_array_reductions.py Show resolved Hide resolved

brandon-b-miller added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 16, 2025

Merge branch 'main' into vendor-test-array-reductions

3ad3e1f

cpcloud approved these changes Oct 21, 2025

View reviewed changes

numba_cuda/numba/cuda/cpython/listobj.py Outdated Show resolved Hide resolved

numba_cuda/numba/cuda/memory_management/nrt.cu Outdated Show resolved Hide resolved

numba_cuda/numba/cuda/memory_management/nrt.cu Show resolved Hide resolved

brandon-b-miller and others added 3 commits October 22, 2025 11:08

Merge branch 'main' into vendor-test-array-reductions

ddb0062

Apply suggestions from code review

398489b

Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>

export

b58d757

gmarkall self-requested a review October 27, 2025 16:07

gmarkall reviewed Oct 27, 2025

View reviewed changes

numba_cuda/numba/cuda/tests/test_array_reductions.py Show resolved Hide resolved

brandon-b-miller added 3 commits October 27, 2025 13:51

Merge branch 'main' into vendor-test-array-reductions

6ae9712

patch

f24b3e9

refactor

1de3332

brandon-b-miller requested a review from gmarkall October 27, 2025 21:16

dont import numba.cuda.misc in device_init

c896826

gmarkall reviewed Oct 28, 2025

View reviewed changes

numba_cuda/numba/cuda/typing/context.py Outdated Show resolved Hide resolved

gmarkall reviewed Oct 28, 2025

View reviewed changes

numba_cuda/numba/cuda/ufuncs.py Show resolved Hide resolved

brandon-b-miller added 2 commits October 28, 2025 08:09

use truediv int impl that casts to float

3aa7174

delay import of literal

f210837

just remove literal from banlist instead

36f30ef

gmarkall approved these changes Oct 28, 2025

View reviewed changes

brandon-b-miller merged commit d71c033 into NVIDIA:main Oct 28, 2025
70 checks passed

brandon-b-miller deleted the vendor-test-array-reductions branch October 28, 2025 16:31

This was referenced Oct 28, 2025

Move frontend tests to cudapy namespace #558

Merged

Vendor numpy median, percentile, and quantile reduction tests #567

Open

brandon-b-miller added a commit that referenced this pull request Nov 7, 2025

Move frontend tests to cudapy namespace (#558)

d193a64

Follow up for #523 (comment) This moves some test files to the cudapy namespace and merges `tests/test_extending` and `cudapy/test_extending.py`.

gmarkall mentioned this pull request Nov 20, 2025

Bump version to 0.21.0 #602

Merged

Conversation

brandon-b-miller commented Oct 14, 2025

Uh oh!

copy-pr-bot bot commented Oct 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

copy-pr-bot bot commented Oct 15, 2025

Uh oh!

brandon-b-miller commented Oct 15, 2025

Uh oh!

brandon-b-miller commented Oct 15, 2025

Uh oh!

brandon-b-miller commented Oct 15, 2025

Uh oh!

brandon-b-miller commented Oct 16, 2025

Uh oh!

Uh oh!

cpcloud left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brandon-b-miller commented Oct 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarkall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brandon-b-miller commented Oct 27, 2025

Uh oh!

brandon-b-miller commented Oct 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandon-b-miller Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brandon-b-miller commented Oct 28, 2025

Uh oh!

Uh oh!

brandon-b-miller commented Oct 28, 2025

Uh oh!

brandon-b-miller commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

brandon-b-miller Oct 28, 2025 •

edited

Loading