feat: add support for cache-hinted load and store operations #587

kaeun97 · 2025-11-10T22:01:49Z

Follow up of #51 requested by @gmarkall.

Add documentation.
~~Test cases for erroneous arguments. It would be good to check for accidental use on shared or local arrays, but this may not be easy to do.~~
Add additional validations as described in comments in ld_cache_operator and st_cache_operator.
Add test for bitwidth validation
Refactor the implementation - the load and store implementations contain a lot of common code.
~~Decide on whether to support complex, and work out why it presently doesn't work.~~
Support CPointer()
Test for supportCPointer()

Updated items from feedback.

Small nits on the documentation
Small nits on the tests
Add tests with 2D arrays
Fix the simulator, either by stub implementations and test skips

copy-pr-bot · 2025-11-10T22:01:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

numba_cuda/numba/cuda/cache_hints.py

gmarkall · 2025-11-11T09:20:51Z

Some relevant discussion also in https://github.com/NVIDIA/numba-cuda/pull/51/files#r2513450704

gmarkall · 2025-11-12T08:54:00Z

Decide on whether to support complex, and work out why it presently doesn't work.

I should mention, I think it's fine if we don't support complex types for this PR, and could add it in a follow-up PR.

gmarkall · 2025-11-17T11:49:25Z

A quick follow up on a couple of items:

Add documentation.

Do you need any assistance here? I know that adding documentation is often a bit hard to find the right place and wording, so I'm happy to add something here if you want.

Test cases for erroneous arguments. It would be good to check for accidental use on shared or local arrays, but this may not be easy to do.

I see you have test cases for various user errors already. I think checking for accidental use on shared or local arrays is probably hard / impossible without something like #236, so maybe this item can be considered already done?

gmarkall · 2025-11-17T11:49:37Z

/ok to test

copy-pr-bot · 2025-11-17T11:49:40Z

/ok to test

@gmarkall, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

gmarkall · 2025-11-17T11:51:56Z

/ok to test 393554a

kaeun97 · 2025-11-19T02:28:01Z

Thanks for the follow up! @gmarkall

I see you have test cases for various user errors already. I think checking for accidental use on shared or local arrays is probably hard / impossible without something like #236, so maybe this item can be considered already done?

I agree. If needed, I can follow up on that later (ideally after #236 gets merged)

Do you need any assistance here? I know that adding documentation is often a bit hard to find the right place and wording, so I'm happy to add something here if you want.

I've added a rough draft that needs refinement (in terms of the content, where to put it, etc). Would you be able to take a quick look? Would be happy to iterate on that. If needed, we can hop on a call or feel free to directly make changes to this branch.

docs/source/reference/cache_hints.rst

numba_cuda/numba/cuda/tests/cudapy/test_cache_hints.py

gmarkall

Many thanks for all the efforts so far - this is looking very good!

I have a few comments on the diff - I know that some of the code I've commented on in the tests comes from my original PR, so I apologise for the review requesting changes on it now.

I think the only gap in testing is a test with arrays with more than one dimension - at the moment the _get_element_pointer function looks correct, but I do think we could do with the else branch of it getting exercised by a test.

The CI reveals that the simulator also needs some stubs for these additions, so that the test code can be imported on the simulator. These could just look like:

ldca = None
...

in somewhere like numba_cuda/numba/cuda/simulator/api.py. Then the tests could be skipped on the simulator by adding a @skip_on_cudasim decorator to the test case.

Alternatively, you could implement these operations in the simulator (they needn't simulate the cache behaviour, they only need to behave as they would on the device without respect to the performance difference the cache policy might make), but I would not bother as the simulator is already quite limited and on the road to being superseded by better availability of GPUs and better on-device debugging support.

So to summarise, I think the remaining items to finish up are:

Small nits on the documentation
Small nits on the tests
Add tests with 2D arrays
Fix the simulator, either by stub implementations and test skips (likely preferable), or by implementing the operations in the simulator (more adventurous, not necessarily more interesting, up to your choice 🙂).

Many thanks again!

kaeun97 · 2025-11-20T01:40:46Z

@gmarkall Thank you for the thorough feedback. I believe I have addressed all items you have mentioned. Would be great if you can re-visit the PR!

gmarkall · 2025-11-20T08:25:00Z

/ok to test

copy-pr-bot · 2025-11-20T08:25:04Z

/ok to test

@gmarkall, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

gmarkall · 2025-11-20T08:25:11Z

/ok to test fe9e8ac

gmarkall

This is looking great - many thanks!

gmarkall · 2025-11-20T11:13:34Z

I'm just waiting to merge this until https://github.com/NVIDIA/numba-cuda/actions/runs/19534738630 completes, as it's a run of a new CI workflow for main - I'd like to make sure we have one complete successful run of that before continuing with merging other PRs.

gmarkall · 2025-11-20T13:06:19Z

Many thanks @kaeun97!

- Add support for cache-hinted load and store operations (NVIDIA#587) - Add more thirdparty tests (NVIDIA#586) - Add sphinx-lint to pre-commit and fix errors (NVIDIA#597) - Add DWARF variant part support for polymorphic variables in CUDA debug info (NVIDIA#544) - chore: clean up dead workaround for unavailable `lru_cache` (NVIDIA#598) - chore(docs): format types docs (NVIDIA#596) - refactor: decouple `Context` from `Stream` and `Event` objects (NVIDIA#579) - Fix freezing in of constant arrays with negative strides (NVIDIA#589) - Update tests to accept variants of generated PTX (NVIDIA#585) - refactor: replace device functionality with `cuda.core` APIs (NVIDIA#581) - Move frontend tests to `cudapy` namespace (NVIDIA#558) - Generalize the concurrency group for main merges (NVIDIA#582) - ci: move pre-commit checks to pre commit action (NVIDIA#577) - chore(pixi): set up doc builds; remove most `build-conda` dependencies (NVIDIA#574) - ci: ensure that python version in ci matches matrix (NVIDIA#575) - Fix the `cuda.is_supported_version()` API (NVIDIA#571) - Fix checks on main (NVIDIA#576) - feat: add `math.nextafter` (NVIDIA#543) - ci: replace conda testing with pixi (NVIDIA#554) - [CI] Run PR workflow on merge to main (NVIDIA#572) - Propose Alternative Module Path for `ext_types` and Maintain `numba.cuda.types.bfloat16` Import API (NVIDIA#569) - test: enable fail-on-warn and clean up resulting failures (NVIDIA#529) - [Refactor][NFC] Vendor-in compiler_lock for future CUDA-specific changes (NVIDIA#565) - Fix registration with Numba, vendor MakeFunctionToJITFunction tests (NVIDIA#566) - [Refactor][NFC][Cleanups] Update imports to upstream numba to use the numba.cuda modules (NVIDIA#561) - test: refactor process-based tests to use concurrent futures in order to simplify tests (NVIDIA#550) - test: revert back to ipc futures that await each iteration (NVIDIA#564) - chore(deps): move to self-contained pixi.toml to avoid mixed-pypi-pixi environments (NVIDIA#551) - [Refactor][NFC] Vendor-in errors for future CUDA-specific changes (NVIDIA#534) - Remove dependencies on target_extension for CUDA target (NVIDIA#555) - Relax the pinning to `cuda-core` to allow it floating across minor releases (NVIDIA#559) - [WIP] Port numpy reduction tests to CUDA (NVIDIA#523) - ci: add timeout to avoid blocking the job queue (NVIDIA#556) - Handle `cuda.core.Stream` in driver operations (NVIDIA#401) - feat: add support for `math.exp2` (NVIDIA#541) - Vendor in types and datamodel for CUDA-specific changes (NVIDIA#533) - refactor: cleanup device constructor (NVIDIA#548) - bench: add cupy to array constructor kernel launch benchmarks (NVIDIA#547) - perf: cache dimension computations (NVIDIA#542) - perf: remove duplicated size computation (NVIDIA#537) - chore(perf): add torch to benchmark (NVIDIA#539) - test: speed up ipc tests by ~6.5x (NVIDIA#527) - perf: speed up kernel launch (NVIDIA#510) - perf: remove context threading in various pointer abstractions (NVIDIA#536) - perf: reduce the number of `__cuda_array_interface__` accesses (NVIDIA#538) - refactor: remove unnecessary custom map and set implementations (NVIDIA#530) - [Refactor][NFC] Vendor-in vectorize decorators for future CUDA-specific changes (NVIDIA#513) - test: add benchmarks for kernel launch for reproducibility (NVIDIA#528) - test(pixi): update pixi testing command to work with the new `testing` directory (NVIDIA#522) - refactor: fully remove `USE_NV_BINDING` (NVIDIA#525) - Draft: Vendor in the IR module (NVIDIA#439) - pyproject.toml: add search path for Pyrefly (NVIDIA#524) - Vendor in numba.core.typing for CUDA-specific changes (NVIDIA#473) - Use numba.config when available, otherwise use numba.cuda.config (NVIDIA#497) - [MNT] Drop NUMBA_CUDA_USE_NVIDIA_BINDING; always use cuda.core and cuda.bindings as fallback (NVIDIA#479) - Vendor in dispatcher, entrypoints, pretty_annotate for CUDA-specific changes (NVIDIA#502) - build: allow parallelization of nvcc testing builds (NVIDIA#521) - chore(dev-deps): add pixi (NVIDIA#505) - Vendor the imputils module for CUDA refactoring (NVIDIA#448) - Don't use `MemoryLeakMixin` for tests that don't use NRT (NVIDIA#519) - Switch back to stable cuDF release in thirdparty tests (NVIDIA#518) - Updating .gitignore with binaries in the `testing` folder (NVIDIA#516) - Remove some unnecessary uses of ContextResettingTestCase (NVIDIA#507) - Vendor in _helperlib cext for CUDA-specific changes (NVIDIA#512) - Vendor in typeconv for future CUDA-specific changes (NVIDIA#499) - [Refactor][NFC] Vendor-in numba.cpython modules for future CUDA-specific changes (NVIDIA#493) - [Refactor][NFC] Vendor-in numba.np modules for future CUDA-specific changes (NVIDIA#494) - Make the CUDA target the default for CUDA overload decorators (NVIDIA#511) - Remove C extension loading hacks (NVIDIA#506) - Ensure NUMBA can manipulate memory from CUDA graphs before the graph is launched (NVIDIA#437) - [Refactor][NFC] Vendor-in core Numba analysis utils for CUDA-specific changes (NVIDIA#433) - Fix Bf16 Test OB Error (NVIDIA#509) - Vendor in components from numba.core.runtime for CUDA-specific changes (NVIDIA#498) - [Refactor] Vendor in _dispatcher, _devicearray, mviewbuf C extension for CUDA-specific customization (NVIDIA#373) - [MNT] Managed UM memset fallback and skip CUDA IPC tests on WSL2 (NVIDIA#488) - Improve debug value range coverage (NVIDIA#461) - Add `compile_all` API (NVIDIA#484) - Vendor in core.registry for CUDA-specific changes (NVIDIA#485) - [Refactor][NFC] Vendor in numba.misc for CUDA-specific changes (NVIDIA#457) - Vendor in optional, boxing for CUDA-specific changes, fix dangling imports (NVIDIA#476) - [test] Remove dependency on cpu_target (NVIDIA#490) - Change dangling imports of numba.core.lowering to numba.cuda.lowering (NVIDIA#475) - [test] Use numpy's tolerance for float16 (NVIDIA#491) - [Refactor][NFC] Vendor-in numba.extending for future CUDA-specific changes (NVIDIA#466) - [Refactor][NFC] Vendor-in more cpython registries for future CUDA-specific changes (NVIDIA#478)

- Add support for cache-hinted load and store operations (#587) - Add more thirdparty tests (#586) - Add sphinx-lint to pre-commit and fix errors (#597) - Add DWARF variant part support for polymorphic variables in CUDA debug info (#544) - chore: clean up dead workaround for unavailable `lru_cache` (#598) - chore(docs): format types docs (#596) - refactor: decouple `Context` from `Stream` and `Event` objects (#579) - Fix freezing in of constant arrays with negative strides (#589) - Update tests to accept variants of generated PTX (#585) - refactor: replace device functionality with `cuda.core` APIs (#581) - Move frontend tests to `cudapy` namespace (#558) - Generalize the concurrency group for main merges (#582) - ci: move pre-commit checks to pre commit action (#577) - chore(pixi): set up doc builds; remove most `build-conda` dependencies (#574) - ci: ensure that python version in ci matches matrix (#575) - Fix the `cuda.is_supported_version()` API (#571) - Fix checks on main (#576) - feat: add `math.nextafter` (#543) - ci: replace conda testing with pixi (#554) - [CI] Run PR workflow on merge to main (#572) - Propose Alternative Module Path for `ext_types` and Maintain `numba.cuda.types.bfloat16` Import API (#569) - test: enable fail-on-warn and clean up resulting failures (#529) - [Refactor][NFC] Vendor-in compiler_lock for future CUDA-specific changes (#565) - Fix registration with Numba, vendor MakeFunctionToJITFunction tests (#566) - [Refactor][NFC][Cleanups] Update imports to upstream numba to use the numba.cuda modules (#561) - test: refactor process-based tests to use concurrent futures in order to simplify tests (#550) - test: revert back to ipc futures that await each iteration (#564) - chore(deps): move to self-contained pixi.toml to avoid mixed-pypi-pixi environments (#551) - [Refactor][NFC] Vendor-in errors for future CUDA-specific changes (#534) - Remove dependencies on target_extension for CUDA target (#555) - Relax the pinning to `cuda-core` to allow it floating across minor releases (#559) - [WIP] Port numpy reduction tests to CUDA (#523) - ci: add timeout to avoid blocking the job queue (#556) - Handle `cuda.core.Stream` in driver operations (#401) - feat: add support for `math.exp2` (#541) - Vendor in types and datamodel for CUDA-specific changes (#533) - refactor: cleanup device constructor (#548) - bench: add cupy to array constructor kernel launch benchmarks (#547) - perf: cache dimension computations (#542) - perf: remove duplicated size computation (#537) - chore(perf): add torch to benchmark (#539) - test: speed up ipc tests by ~6.5x (#527) - perf: speed up kernel launch (#510) - perf: remove context threading in various pointer abstractions (#536) - perf: reduce the number of `__cuda_array_interface__` accesses (#538) - refactor: remove unnecessary custom map and set implementations (#530) - [Refactor][NFC] Vendor-in vectorize decorators for future CUDA-specific changes (#513) - test: add benchmarks for kernel launch for reproducibility (#528) - test(pixi): update pixi testing command to work with the new `testing` directory (#522) - refactor: fully remove `USE_NV_BINDING` (#525) - Draft: Vendor in the IR module (#439) - pyproject.toml: add search path for Pyrefly (#524) - Vendor in numba.core.typing for CUDA-specific changes (#473) - Use numba.config when available, otherwise use numba.cuda.config (#497) - [MNT] Drop NUMBA_CUDA_USE_NVIDIA_BINDING; always use cuda.core and cuda.bindings as fallback (#479) - Vendor in dispatcher, entrypoints, pretty_annotate for CUDA-specific changes (#502) - build: allow parallelization of nvcc testing builds (#521) - chore(dev-deps): add pixi (#505) - Vendor the imputils module for CUDA refactoring (#448) - Don't use `MemoryLeakMixin` for tests that don't use NRT (#519) - Switch back to stable cuDF release in thirdparty tests (#518) - Updating .gitignore with binaries in the `testing` folder (#516) - Remove some unnecessary uses of ContextResettingTestCase (#507) - Vendor in _helperlib cext for CUDA-specific changes (#512) - Vendor in typeconv for future CUDA-specific changes (#499) - [Refactor][NFC] Vendor-in numba.cpython modules for future CUDA-specific changes (#493) - [Refactor][NFC] Vendor-in numba.np modules for future CUDA-specific changes (#494) - Make the CUDA target the default for CUDA overload decorators (#511) - Remove C extension loading hacks (#506) - Ensure NUMBA can manipulate memory from CUDA graphs before the graph is launched (#437) - [Refactor][NFC] Vendor-in core Numba analysis utils for CUDA-specific changes (#433) - Fix Bf16 Test OB Error (#509) - Vendor in components from numba.core.runtime for CUDA-specific changes (#498) - [Refactor] Vendor in _dispatcher, _devicearray, mviewbuf C extension for CUDA-specific customization (#373) - [MNT] Managed UM memset fallback and skip CUDA IPC tests on WSL2 (#488) - Improve debug value range coverage (#461) - Add `compile_all` API (#484) - Vendor in core.registry for CUDA-specific changes (#485) - [Refactor][NFC] Vendor in numba.misc for CUDA-specific changes (#457) - Vendor in optional, boxing for CUDA-specific changes, fix dangling imports (#476) - [test] Remove dependency on cpu_target (#490) - Change dangling imports of numba.core.lowering to numba.cuda.lowering (#475) - [test] Use numpy's tolerance for float16 (#491) - [Refactor][NFC] Vendor-in numba.extending for future CUDA-specific changes (#466) - [Refactor][NFC] Vendor-in more cpython registries for future CUDA-specific changes (#478)

chore: move work from gmarkall branch

cecd6ea

kaeun97 changed the title ~~chore: move work from gmarkall branch~~ [wip] Add support for cache-hinted load and store operations Nov 10, 2025

fix: add license and nits

a2b7127

gmarkall reviewed Nov 11, 2025

View reviewed changes

numba_cuda/numba/cuda/cache_hints.py Outdated Show resolved Hide resolved

gmarkall mentioned this pull request Nov 11, 2025

Add support for cache-hinted load and store operations #51

Closed

5 tasks

kaeun97 mentioned this pull request Nov 12, 2025

[FEA] Add load / store operations with cache hints #215

Closed

kaeun97 added 7 commits November 12, 2025 00:48

feat: validate bitwidth

9963469

chore: extract common components

9b5c51a

feat: support cpointer

671bcf3

feat: add test for cpointers

06a1c35

fix: add test for bitwidth

f4da9db

fix: add test for bitwidth

94ec9a3

chore: remove comment

393554a

gmarkall added the 2 - In Progress Currently a work in progress label Nov 12, 2025

feat: add initial docs

b53cc78

gmarkall added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 19, 2025

gmarkall marked this pull request as ready for review November 19, 2025 12:16