Use static_for for fixed-size loops #771

PointKernel · 2025-10-23T20:21:06Z

Closes #770

This PR replaces naive for loops with cuda::static_for in cases where the loop size is known at compile time, enabling guaranteed compile-time unrolling and improving runtime performance. It also updates CMake to treat device compiler warnings as errors and addresses warnings for variables like gen being declared but unused, which appears to be a compiler bug in CUDA 12.0.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh (2)

627-648: Fix UB: avoid computing slot_ptr with intra_bucket_index = -1 before ballots (insert_and_find CG).
Compute pointer only in src_lane and broadcast as intptr_t. Current code evaluates get_slot_ptr even when state stays UNEQUAL, which is UB.

Apply this patch:

@@
-      auto* slot_ptr = this->get_slot_ptr(*probing_iter, intra_bucket_index);
@@
-      if (group_finds_equal) {
+      if (group_finds_equal) {
         auto const src_lane = __ffs(group_finds_equal) - 1;
-        auto const res      = group.shfl(reinterpret_cast<intptr_t>(slot_ptr), src_lane);
-        if (group.thread_rank() == src_lane) {
-          if constexpr (has_payload) {
-            // wait to ensure that the write to the value part also took place
-            this->wait_for_payload(slot_ptr->second, this->empty_value_sentinel());
-          }
-        }
-        group.sync();
-        return {iterator{reinterpret_cast<value_type*>(res)}, false};
+        intptr_t ptr_val = 0;
+        if (group.thread_rank() == src_lane) {
+          auto* slot_ptr = this->get_slot_ptr(*probing_iter, intra_bucket_index);
+          if constexpr (has_payload) {
+            this->wait_for_payload(slot_ptr->second, this->empty_value_sentinel());
+          }
+          ptr_val = reinterpret_cast<intptr_t>(slot_ptr);
+        }
+        ptr_val = group.shfl(ptr_val, src_lane);
+        return {iterator{reinterpret_cast<value_type*>(ptr_val)}, false};
       }
@@
-      if (group_contains_available) {
+      if (group_contains_available) {
         auto const src_lane = __ffs(group_contains_available) - 1;
-        auto const res      = group.shfl(reinterpret_cast<intptr_t>(slot_ptr), src_lane);
-        auto const status   = [&, target_idx = intra_bucket_index]() {
-          if (group.thread_rank() != src_lane) { return insert_result::CONTINUE; }
-          return this->attempt_insert_stable(slot_ptr, bucket_slots[target_idx], val);
-        }();
+        intptr_t ptr_val = 0;
+        auto const status = [&, target_idx = intra_bucket_index]() {
+          if (group.thread_rank() != src_lane) { return insert_result::CONTINUE; }
+          auto* slot_ptr = this->get_slot_ptr(*probing_iter, intra_bucket_index);
+          ptr_val        = reinterpret_cast<intptr_t>(slot_ptr);
+          return this->attempt_insert_stable(slot_ptr, bucket_slots[target_idx], val);
+        }();
@@
-            group.sync();
-            return {iterator{reinterpret_cast<value_type*>(res)}, true};
+            ptr_val = group.shfl(ptr_val, src_lane);
+            return {iterator{reinterpret_cast<value_type*>(ptr_val)}, true};
@@
-            group.sync();
-            return {iterator{reinterpret_cast<value_type*>(res)}, false};
+            ptr_val = group.shfl(ptr_val, src_lane);
+            return {iterator{reinterpret_cast<value_type*>(ptr_val)}, false};

Also applies to: 644-675

946-952: Fix UB similarly in find(CG) equal path.
Broadcast only from src_lane; compute pointer in that lane.

-        auto const res      = group.shfl(
-          reinterpret_cast<intptr_t>(this->get_slot_ptr(*probing_iter, intra_bucket_index)),
-          src_lane);
-        return iterator{reinterpret_cast<value_type*>(res)};
+        intptr_t ptr_val = 0;
+        if (group.thread_rank() == src_lane) {
+          ptr_val = reinterpret_cast<intptr_t>(
+            this->get_slot_ptr(*probing_iter, intra_bucket_index));
+        }
+        ptr_val = group.shfl(ptr_val, src_lane);
+        return iterator{reinterpret_cast<value_type*>(ptr_val)};

🧹 Nitpick comments (3)

include/cuco/detail/hyperloglog/kernels.cuh (1)

75-78: Remainder handling via cuda::static_for looks correct.
Bounds guard i() < remainder prevents OOB; reverse indexing is fine. Optional: read forward as first + (n - remainder) + i() for readability.

Also applies to: 82-85

include/cuco/detail/bloom_filter/bloom_filter_impl.cuh (2)

208-213: Nit: redundant j() < ... guards.
cuda::static_for iterates [0, N), so j() < num_threads / worker_num_threads is always true. You can drop that part to simplify the predicate.

-        cuda::static_for<num_threads>([&](auto j) {
-          if ((j() < num_threads) and (i + j() < num_keys)) {
+        cuda::static_for<num_threads>([&](auto j) {
+          if (i + j() < num_keys) {
             this->add_impl(group, group.shfl(hash_value, j()), group.shfl(block_index, j()));
           }
         });
@@
-        cuda::static_for<worker_num_threads>([&](auto j) {
-          if ((j() < worker_num_threads) and (i + worker_offset + j() < num_keys)) {
+        cuda::static_for<worker_num_threads>([&](auto j) {
+          if (i + worker_offset + j() < num_keys) {
             this->add_impl(worker_group,
                            worker_group.shfl(hash_value, j()),
                            worker_group.shfl(block_index, j()));
           }
         });

Also applies to: 230-237

364-375: Optional: short‑circuit inner loop when mismatch is found.
Guard inner static_for body with if (success) to avoid extra loads once false.

-          cuda::static_for<words_per_thread>([&](auto j) {
-            auto const expected_pattern = policy_.word_pattern(hash_value, thread_offset + j());
-            if ((stored_pattern[j()] & expected_pattern) != expected_pattern) { success = false; }
-          });
+          cuda::static_for<words_per_thread>([&](auto j) {
+            if (success) {
+              auto const expected_pattern =
+                policy_.word_pattern(hash_value, thread_offset + j());
+              if ((stored_pattern[j()] & expected_pattern) != expected_pattern) { success = false; }
+            }
+          });

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7283ed and fabc460.

📒 Files selected for processing (5)

include/cuco/detail/bloom_filter/bloom_filter_impl.cuh (7 hunks)
include/cuco/detail/hyperloglog/kernels.cuh (2 hunks)
include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh (11 hunks)
include/cuco/detail/roaring_bitmap/roaring_bitmap_impl.cuh (1 hunks)
include/cuco/detail/static_map/static_map_ref.inl (3 hunks)

🔇 Additional comments (11)

include/cuco/detail/hyperloglog/kernels.cuh (1)

23-23: LGTM: required header for static_for.
#include <cuda/utility> addition is correct and consistent with cuda::static_for usage.

include/cuco/detail/static_map/static_map_ref.inl (1)

27-27: LGTM: header include.
#include <cuda/utility> is required for cuda::static_for and related utilities.

include/cuco/detail/bloom_filter/bloom_filter_impl.cuh (4)

146-154: LGTM: per-word static_for + atomic OR.
Clear, fully unrolled; avoids loop-carried deps and keeps atomics coalesced.

253-260: LGTM: CG-strided static_for path.
Correct per-rank striding over words_per_block; coalesced atomics preserved.

339-345: LGTM: scalar contains with static_for.
Fully unrolled check with short-circuit flag is fine.

37-37: LGTM: header include.
#include <cuda/utility> is necessary for cuda::static_for.

include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh (5)

444-464: LGTM: static_for-based per-bucket scans.
Pattern correctly emulates early-exit via a state accumulator; keeps codegen predictable.

Also applies to: 750-759, 932-942

987-994: LGTM: count() static_for population.
Per-slot result accumulation with arrays is clear; EMPTY short-circuit preserved.

Also applies to: 1030-1034

1338-1395: LGTM: retrieve_impl bucket pass.
Use of static_for for equals/existence and buffered flush is sound; ballots/shfls are gated correctly.

1459-1477: LGTM: non-CG for_each short-circuit.
Guarding with should_return preserves early-exit semantics after EMPTY.

31-31: LGTM: header include.
#include <cuda/utility> required for cuda::static_for.

PointKernel · 2025-10-24T21:21:06Z

include/cuco/detail/bloom_filter/bloom_filter_impl.cuh

+    add_impl_functor<HashValue, BlockIndex, policy_type, word_type, thread_scope> functor{
+      hash_value, block_index, policy_, words_, words_per_block};
+    cuda::static_for<words_per_block>(functor);


Interestingly, in all other cases we can use a lambda with static_for without any issues, but in all Bloom filter implementations we have to resort to functors to avoid compiler errors. It’s unclear what makes the Bloom filter implementation behave differently from other use cases.

No clue what's causing this but since the Bloom filter implementation will be replaced soon I'd say the workaround with the functor is fine.

sleeepyjack

Nice work. Can you verify that we're not introducing any regressions with this? It might me that we are now unrolling loops which were previously not unrolled, potentially adding to the register pressure problem. I'd say comparing benchmarks for static_map and static_multimap should be sufficient.

sleeepyjack · 2025-10-30T22:30:18Z

benchmarks/static_set/contains_bench.cu

+  [[maybe_unused]] key_generator gen{};
  gen.generate(dist_from_state<Dist>(state), keys.begin(), keys.end());


How can gen be maybe_unused when it is used immediately in the following line? What am I missing?

That’s a compiler bug in CUDA 12.0 where host-side use of host-device APIs isn’t correctly recognized.

benchmarks/CMakeLists.txt

sleeepyjack · 2025-10-30T22:32:41Z

include/cuco/detail/bloom_filter/bloom_filter_impl.cuh

+    add_impl_functor<HashValue, BlockIndex, policy_type, word_type, thread_scope> functor{
+      hash_value, block_index, policy_, words_, words_per_block};
+    cuda::static_for<words_per_block>(functor);


No clue what's causing this but since the Bloom filter implementation will be replaced soon I'd say the workaround with the functor is fine.

CMakeLists.txt

Use static_for for fixed-size loops

fabc460

PointKernel requested a review from sleeepyjack as a code owner October 23, 2025 20:21

PointKernel added the type: improvement Improvement / enhancement to an existing function label Oct 23, 2025

coderabbitai bot reviewed Oct 23, 2025

View reviewed changes

PointKernel mentioned this pull request Oct 23, 2025

[BUG]: Improve cuda::static_for to avoid dependency on --expt-relaxed-constexpr NVIDIA/cccl#6331

Closed

1 task

PointKernel added 2 commits October 24, 2025 13:54

Use ftors instead of lambda to fix 12.0 warnings

634c671

CMake cleanups

17ec7ec

NVIDIA deleted a comment from coderabbitai bot Oct 24, 2025

PointKernel commented Oct 24, 2025

View reviewed changes

PointKernel added the Needs Review Awaiting reviews before merging label Oct 24, 2025

sleeepyjack reviewed Oct 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use static_for for fixed-size loops #771

Use static_for for fixed-size loops #771

Uh oh!

PointKernel commented Oct 23, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

PointKernel Oct 24, 2025

Uh oh!

sleeepyjack Oct 30, 2025

Uh oh!

sleeepyjack left a comment

Uh oh!

sleeepyjack Oct 30, 2025

Uh oh!

PointKernel Nov 3, 2025

Uh oh!

Uh oh!

sleeepyjack Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		[[maybe_unused]] key_generator gen{};
		gen.generate(dist_from_state<Dist>(state), keys.begin(), keys.end());

Use static_for for fixed-size loops #771

Are you sure you want to change the base?

Use static_for for fixed-size loops #771

Uh oh!

Conversation

PointKernel commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

PointKernel Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

sleeepyjack Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

sleeepyjack left a comment

Choose a reason for hiding this comment

Uh oh!

sleeepyjack Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

PointKernel Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sleeepyjack Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PointKernel commented Oct 23, 2025 •

edited

Loading