Unify operator handling in cuda.compute by shwina · Pull Request #6938 · NVIDIA/cccl

shwina · 2025-12-10T18:10:20Z

Description

This PR is a refactor, in preparation for supporting stateful ops in cuda.compute.

The problem(s)

Duplicated logic for handling `OpKind` v/s callables

Currently, user-defined operations (e.g., predicates, transformations, etc.,) can be provided either as "built-in" ops (OpKind) or custom functions (callables). These differ in the way they are (1) cached and (2) "compiled" into LTOIR. To handle these differences, we currently have a bunch of if-else statements everywhere, like:

if isinstance(op, OpKind):
    self.op_wrapper = cccl.to_cccl_op(op, None)
else:
    self.op_wrapper = cccl.to_cccl_op(op, value_type(value_type, value_type))

and:

    # Handle well-known operations differently
    op_key: Union[tuple[str, int], CachableFunction]
    if isinstance(op, OpKind):
        op_key = (op.name, op.value)
    else:
        op_key = CachableFunction(op)

Determining signatures from annotations

If provided, type annotations offer a faster way to determine the return type of a user-defined callable, compared to using numba type inference. We take advantage of this in., e.g., transform, but it would be nice to do this for all ops. Ideally, we don't want to repeat the logic everywhere.

Solution

This PR solves the above by introducing an (internal) OpAdaptor type that encapsulates the logic for caching, signature determination, and compiling. Furthermore, it will make adding support for stateless ops much easier.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-12-10T18:10:24Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

shwina · 2025-12-10T18:10:34Z

/ok to test 2d3f647

shwina · 2025-12-10T18:11:20Z

/ok to test 2d3f647

shwina · 2025-12-10T19:04:45Z

/ok to test 2d3f647

shwina · 2025-12-10T19:53:04Z

/ok to test 2d3f647

shwina · 2025-12-10T19:56:29Z

python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py


-@cache_with_key(make_cache_key)
+@cache_with_key(_make_cache_key)
+def _make_merge_sort_cached(


Note to reviewers: this approach does introduce a layer of indirection here in the caching.

I do have some ideas for unifying caching across all algorithms which should simplify this, but I'll do that in a subsequent PR.

NaderAlAwar

Looks good, left a few comments. We should also be careful not to introduce too much overhead to the single phase API

NaderAlAwar · 2025-12-10T20:28:51Z

python/cuda_cccl/cuda/compute/op.py

+        self._kind = kind
+
+    def get_cache_key(self) -> Hashable:
+        return (self.__class__.__name__, self._kind.name, self._kind.value)


Question: prior to this change, we only return (op.name, op.value). Why do we need to include self.__class__.__name__?

Fixed - ditto below

NaderAlAwar · 2025-12-10T20:29:29Z

python/cuda_cccl/cuda/compute/op.py

+        self._cachable = CachableFunction(func)
+
+    def get_cache_key(self) -> Hashable:
+        return (self.__class__.__name__, self._cachable)


Same question as above, why do we need self.__class__.__name__?

NaderAlAwar · 2025-12-10T20:35:16Z

python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py

        self.d_in_items_cccl = cccl.to_cccl_input_iter(d_in_items)
        self.d_out_keys_cccl = cccl.to_cccl_output_iter(d_out_keys)
        self.d_out_items_cccl = cccl.to_cccl_output_iter(d_out_items)
+        self.op_adapter = op


Question: why do we store op_adapter as a member variable? This also applies to other algorithms

When we do introduce stateful operators, the op_adapter is what will hold the state arrays. That being said, let me remove this change from this PR and introduce it (or something else) in the subsequent PR.

NaderAlAwar · 2025-12-10T20:37:36Z

python/cuda_cccl/cuda/compute/algorithms/_reduce.py

        self.d_in_cccl = cccl.to_cccl_input_iter(d_in)
        self.d_out_cccl = cccl.to_cccl_output_iter(d_out)
        self.h_init_cccl = cccl.to_cccl_value(h_init)
+        self.op = op


Important: this is named op_adapter in merge_sort. We should use consistent names.

I'll use op everywhere.

NaderAlAwar · 2025-12-10T20:40:35Z

python/cuda_cccl/cuda/compute/algorithms/_select.py

        d_out: DeviceArrayLike | IteratorBase,
        d_num_selected_out: DeviceArrayLike,
-        cond: Callable,
+        cond: Callable | OpAdapter,  # Raw callable or Operator


Question: it is not clear to me why this is annotated differently than the other algorithms? Also the comment seems unnecessary

shwina · 2025-12-10T21:46:49Z

/ok to test 557b6a0

github-actions · 2025-12-10T23:17:16Z

🥳 CI Workflow Results

🟩 Finished in 1h 28m: Pass: 100%/48 | Total: 11h 31m | Max: 44m 25s

See results here.

Unify operator handling

2d3f647

github-project-automation bot added this to CCCL Dec 10, 2025

github-project-automation bot moved this to Todo in CCCL Dec 10, 2025

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Dec 10, 2025

shwina commented Dec 10, 2025

View reviewed changes

shwina marked this pull request as ready for review December 10, 2025 20:11

shwina requested a review from a team as a code owner December 10, 2025 20:11

shwina requested a review from leofang December 10, 2025 20:11

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Dec 10, 2025

NaderAlAwar approved these changes Dec 10, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

Addressed minor review comments

557b6a0

shwina enabled auto-merge (squash) December 10, 2025 21:49

shwina merged commit 01fa22a into main Dec 10, 2025
62 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Dec 10, 2025

shwina deleted the op-adapter-refactor branch December 11, 2025 09:57

shwina mentioned this pull request Dec 17, 2025

[cuda.compute]: Fixes and updates to benchmarks #6999

Merged

2 tasks

shwina mentioned this pull request Jan 18, 2026

cuda.compute: Consolidate caching logic across all algorithms #7281

Merged

2 tasks

Comments

Conversation

shwina commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

The problem(s)

Duplicated logic for handling OpKind v/s callables

Determining signatures from annotations

Solution

Checklist

Uh oh!

copy-pr-bot bot commented Dec 10, 2025

Uh oh!

shwina commented Dec 10, 2025

Uh oh!

shwina commented Dec 10, 2025

Uh oh!

shwina commented Dec 10, 2025

Uh oh!

shwina commented Dec 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NaderAlAwar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

shwina commented Dec 10, 2025

Uh oh!

github-actions bot commented Dec 10, 2025

🥳 CI Workflow Results

🟩 Finished in 1h 28m: Pass: 100%/48 | Total: 11h 31m | Max: 44m 25s

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shwina commented Dec 10, 2025 •

edited

Loading

Duplicated logic for handling `OpKind` v/s callables