Skip to content

Add support for cache-hinted load and store operations#51

Closed
gmarkall wants to merge 5 commits intoNVIDIA:mainfrom
gmarkall:cache-hints
Closed

Add support for cache-hinted load and store operations#51
gmarkall wants to merge 5 commits intoNVIDIA:mainfrom
gmarkall:cache-hints

Conversation

@gmarkall
Copy link
Contributor

@gmarkall gmarkall commented Oct 3, 2024

To-do:

  • Add documentation.
  • Test cases for erroneous arguments. It would be good to check for accidental use on shared or local arrays, but this may not be easy to do.
  • Add additional validations as described in comments in ld_cache_operator and st_cache_operator.
  • Refactor the implementation - the load and store implementations contain a lot of common code.
  • Decide on whether to support complex, and work out why it presently doesn't work.

@gmarkall gmarkall added the 2 - In Progress Currently a work in progress label Oct 3, 2024
@gmarkall gmarkall added this to the v0.0.19 milestone Oct 21, 2024
@gmarkall
Copy link
Contributor Author

A note: these also need to work for CPointer() types as well as arrays.

@gmarkall gmarkall modified the milestones: v0.0.20, v0.0.21, v0.0.22 Dec 4, 2024
@gmarkall gmarkall modified the milestones: v0.3.0, v0.4.0 Jan 2, 2025
@gmarkall gmarkall modified the milestones: v0.4.0, v0.5.0 Jan 27, 2025
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@rparolin rparolin removed this from the v0.21.0 milestone Oct 9, 2025
@gmarkall
Copy link
Contributor Author

gmarkall commented Nov 7, 2025

@kaeun97 I started this a while back but never managed to address all the to-do items. Essentially the PR was aimed at adding new functions for cache-hinted load and store operations, but I didn't get the implementation and error handling to be cleaner than the current "prototype" like form. It might be interesting as it is another example of adding new code generation to Numba-CUDA.

What do you think of the PR? Would it be of any interest to you to take it on and complete the to-do list items?

@kaeun97
Copy link
Contributor

kaeun97 commented Nov 7, 2025

@gmarkall Happy to continue the work :) Thank you!

@gmarkall
Copy link
Contributor Author

gmarkall commented Nov 7, 2025

@kaeun97 Thanks! Unfortunately I don't think I have a way to give you permissions to push to this branch / PR, but please feel to open a new PR for the continuation; I'll close this one at that point.

Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. TIL about these instructions.

msg = f"Expected {array.ndim} indices, got {index.count}"
raise NumbaTypeError(msg)

if all([isinstance(t, types.Integer) for t in index.dtype]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if all([isinstance(t, types.Integer) for t in index.dtype]):
if all(isinstance(t, types.Integer) for t in index.dtype):

No reason to create a list if you don't need to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will apply this change in the follow up PR :)

Comment on lines +183 to +185
def impl(array, i):
return ldca_intrinsic(array, i)
return impl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this just be:

Suggested change
def impl(array, i):
return ldca_intrinsic(array, i)
return impl
return lcda_intrinsic

for each of these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An overload should return a Python function that gets jit-compiled to become the implementation of the function it's overloading. ldca_intrinsic is an intrinsic, called from within the compiled function.

I suspect it will not work to return an intrinsic as an overload implementation, but even if it did, it would feel jarring to me to contract a level of abstraction in the implementation here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem really weird to my eye that a function foo(*args) whose only line is to call bar(*args) cannot simply be replaced with the call to bar(*args).

This feels like a numbaism perhaps that violates basic substitution rules, but this is of course not a blocking comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the confusion here comes from reading the code as if it's going to be interpreted by the Python interpreter, rather than seeing it as a form of metaprogramming, which is what functions decorated with @overload implement.

An attempt to summarise the pertinent points:

  • An @overload function returns a Python function that Numba-CUDA compiles.
  • An @intrinsic function (like ldca_intrinsic, generated by ld_cache_operator() above), is a function that returns a tuple of (signature, codegen), where:
    • signature is the typing signature that Numba-CUDA uses during type inference to determine and validate the function's argument and return types, and
    • codegen is a function that Numba-CUDA calls to generate the LLVM IR for the implementation.
  • During the compilation process (when impl() is being compiled), the typing and lowering for intrinsics are resolved, and the implementation of the intrinsic generated by the codegen() function is inserted into the generated code.
  • In the compilation process for impl(), type inference and code generation implements ldca_intrinsic() as a function that returns a scalar value, and accepts and array and index as arguments. The typing is defined by the signature on line 96, and the code generation function follows below it.
  • Therefore, if you replace return impl with return ldca_intrinsic in ol_ldca(), you have replaced a function that accepts an array and an index then returns a scalar (impl(array, i) -> array.dtype), with one that accepts a typing context and a number of LLVM IR values, and returns a signature and code generation function (impl(typingctx, array, index) -> (Signature, Function)).

I hope this clarifies things a bit, but for a more complete understanding I can't see a shortcut that avoids working through the low- and high-level extension API documentation for Numba:

That said, I do have a couple of worked examples that show the whole flow in one notebook for each of these APIs for the CUDA target, which may also help:

They may be a little out-of-date and need a couple of bits updating, but the general flow of them is still relevant.

We're using the High-level API here (that @overload and @intrinsic are part of), but it's probably hard to understand the High-level API without first understanding the Low-level API. The High-level API is intended to make it quicker and easier to write Numba extensions, but in my view the main thing it provides is some shorthand for a lot of Low-level API work.

Finally - what happens if you implement the suggested change (modulo the typo in the name of the function above)? You will get:

AssertionError: Implementation function returned by `@overload` has an unexpected type.  Got <intrinsic impl>

cc @kaeun97 as the explanation might be helpful in understanding the PR as a whole.

@gmarkall
Copy link
Contributor Author

Closing this as #587 supersedes it.

@gmarkall gmarkall closed this Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 - In Progress Currently a work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants