Add support for cache-hinted load and store operations #51

cpcloud · 2025-11-10T18:00:03Z

Suggested change

if all([isinstance(t, types.Integer) for t in index.dtype]):

if all(isinstance(t, types.Integer) for t in index.dtype):

No reason to create a list if you don't need to.

Will apply this change in the follow up PR :)

cpcloud · 2025-11-10T18:02:21Z

Can't this just be:

Suggested change

def impl(array, i):

return ldca_intrinsic(array, i)

return impl

return lcda_intrinsic

for each of these?

An overload should return a Python function that gets jit-compiled to become the implementation of the function it's overloading. ldca_intrinsic is an intrinsic, called from within the compiled function.

I suspect it will not work to return an intrinsic as an overload implementation, but even if it did, it would feel jarring to me to contract a level of abstraction in the implementation here.

It does seem really weird to my eye that a function foo(*args) whose only line is to call bar(*args) cannot simply be replaced with the call to bar(*args).

This feels like a numbaism perhaps that violates basic substitution rules, but this is of course not a blocking comment.

I think the confusion here comes from reading the code as if it's going to be interpreted by the Python interpreter, rather than seeing it as a form of metaprogramming, which is what functions decorated with @overload implement.

An attempt to summarise the pertinent points:

An @overload function returns a Python function that Numba-CUDA compiles.

An @intrinsic function (like ldca_intrinsic, generated by ld_cache_operator() above), is a function that returns a tuple of (signature, codegen), where:

signature is the typing signature that Numba-CUDA uses during type inference to determine and validate the function's argument and return types, and

codegen is a function that Numba-CUDA calls to generate the LLVM IR for the implementation.

During the compilation process (when impl() is being compiled), the typing and lowering for intrinsics are resolved, and the implementation of the intrinsic generated by the codegen() function is inserted into the generated code.

In the compilation process for impl(), type inference and code generation implements ldca_intrinsic() as a function that returns a scalar value, and accepts and array and index as arguments. The typing is defined by the signature on line 96, and the code generation function follows below it.

Therefore, if you replace return impl with return ldca_intrinsic in ol_ldca(), you have replaced a function that accepts an array and an index then returns a scalar (impl(array, i) -> array.dtype), with one that accepts a typing context and a number of LLVM IR values, and returns a signature and code generation function (impl(typingctx, array, index) -> (Signature, Function)).

I hope this clarifies things a bit, but for a more complete understanding I can't see a shortcut that avoids working through the low- and high-level extension API documentation for Numba:

Low-level API

High-level API

That said, I do have a couple of worked examples that show the whole flow in one notebook for each of these APIs for the CUDA target, which may also help:

Extending Numba-CUDA with the Low-level API

Extending Numba-CUDA with the High-level API

They may be a little out-of-date and need a couple of bits updating, but the general flow of them is still relevant.

We're using the High-level API here (that @overload and @intrinsic are part of), but it's probably hard to understand the High-level API without first understanding the Low-level API. The High-level API is intended to make it quicker and easier to write Numba extensions, but in my view the main thing it provides is some shorthand for a lot of Low-level API work.

Finally - what happens if you implement the suggested change (modulo the typo in the name of the function above)? You will get:

AssertionError: Implementation function returned by `@overload` has an unexpected type. Got <intrinsic impl>

cc @kaeun97 as the explanation might be helpful in understanding the PR as a whole.

-Original file line number
+Diff line change
@@ -1,3 +1,5 @@
+    from numba import types
+    from numba.core import cgutils
     import numpy as np
@@ Expand Down Expand Up / @@ -28,3 +30,26 @@ def _fill_stride_by_order(shape, dtype, order): @@
         else:
             raise ValueError('must be either C/F order')
         return tuple(strides)
+    def normalize_indices(context, builder, indty, inds, aryty, valty):
+        """
+        Convert integer indices into tuple of intp
+        """
+        if indty in types.integer_domain:
+            indty = types.UniTuple(dtype=indty, count=1)
+            indices = [inds]
+        else:
+            indices = cgutils.unpack_tuple(builder, inds, count=len(indty))
+        indices = [context.cast(builder, i, t, types.intp)
+                   for t, i in zip(indty, indices)]
+        dtype = aryty.dtype
+        if dtype != valty:
+            raise TypeError("expect %s but got %s" % (dtype, valty))
+        if aryty.ndim != len(indty):
+            raise TypeError("indexing %d-D array with %d-D index" %
+                            (aryty.ndim, len(indty)))
+        return indty, indices

-Original file line number
+Diff line change
@@ Expand Up / @@ -13,6 +13,7 @@ @@
     from numba.np.npyimpl import register_ufuncs
     from .cudadrv import nvvm
     from numba import cuda
+    from numba.cuda.api_util import normalize_indices
     from numba.cuda import nvvmutils, stubs, errors
     from numba.cuda.types import dim3, CUDADispatcher
@@ Expand Down Expand Up / @@ -692,38 +693,15 @@ def impl(context, builder, sig, args): @@
     lower(math.degrees, types.f8)(gen_deg_rad(_rad2deg))
-    def _normalize_indices(context, builder, indty, inds, aryty, valty):
-        """
-        Convert integer indices into tuple of intp
-        """
-        if indty in types.integer_domain:
-            indty = types.UniTuple(dtype=indty, count=1)
-            indices = [inds]
-        else:
-            indices = cgutils.unpack_tuple(builder, inds, count=len(indty))
-        indices = [context.cast(builder, i, t, types.intp)
-                   for t, i in zip(indty, indices)]
-        dtype = aryty.dtype
-        if dtype != valty:
-            raise TypeError("expect %s but got %s" % (dtype, valty))
-        if aryty.ndim != len(indty):
-            raise TypeError("indexing %d-D array with %d-D index" %
-                            (aryty.ndim, len(indty)))
-        return indty, indices
     def _atomic_dispatcher(dispatch_fn):
         def imp(context, builder, sig, args):
             # The common argument handling code
             aryty, indty, valty = sig.args
             ary, inds, val = args
             dtype = aryty.dtype
-            indty, indices = _normalize_indices(context, builder, indty, inds,
-                                                aryty, valty)
+            indty, indices = normalize_indices(context, builder, indty, inds,
+                                               aryty, valty)
             lary = context.make_array(aryty)(context, builder, ary)
             ptr = cgutils.get_item_pointer(context, builder, aryty, lary, indices,
@@ Expand Down Expand Up / @@ -917,8 +895,8 @@ def ptx_atomic_cas(context, builder, sig, args): @@
         aryty, indty, oldty, valty = sig.args
         ary, inds, old, val = args
-        indty, indices = _normalize_indices(context, builder, indty, inds, aryty,
-                                            valty)
+        indty, indices = normalize_indices(context, builder, indty, inds, aryty,
+                                           valty)
         lary = context.make_array(aryty)(context, builder, ary)
         ptr = cgutils.get_item_pointer(context, builder, aryty, lary, indices,
@@ Expand Down @@

-Original file line number
+Diff line change
@@ -1,6 +1,8 @@
     # Re export
     import sys
     from numba.cuda import cg
+    from numba.cuda.cache_hints import (ldca, ldcg, ldcs, ldlu, ldcv, stcg, stcs,
+                                        stwb, stwt)
     from .stubs import (threadIdx, blockIdx, blockDim, gridDim, laneid, warpsize,
                         syncwarp, shared, local, const, atomic,
                         shfl_sync_intrinsic, vote_sync_intrinsic, match_any_sync,
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for cache-hinted load and store operations #51

Uh oh!

Diff view

Diff view

There are no files selected for viewing

cpcloud Nov 10, 2025

Uh oh!

kaeun97 Nov 10, 2025

Uh oh!

cpcloud Nov 10, 2025

Uh oh!

gmarkall Nov 10, 2025

Uh oh!

cpcloud Nov 10, 2025

Uh oh!

gmarkall Nov 11, 2025

Uh oh!

Uh oh!

Add support for cache-hinted load and store operations #51

Uh oh!

Add support for cache-hinted load and store operations #51

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

cpcloud Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

kaeun97 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

cpcloud Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

gmarkall Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

cpcloud Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

gmarkall Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!