Skip to content

Conversation

@PointKernel
Copy link
Member

@PointKernel PointKernel commented Oct 23, 2025

Closes #770

This PR replaces naive for loops with cuda::static_for in cases where the loop size is known at compile time, enabling guaranteed compile-time unrolling and improving runtime performance. It also updates CMake to treat device compiler warnings as errors and addresses warnings for variables like gen being declared but unused, which appears to be a compiler bug in CUDA 12.0.

@PointKernel PointKernel added the type: improvement Improvement / enhancement to an existing function label Oct 23, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh (2)

627-648: Fix UB: avoid computing slot_ptr with intra_bucket_index = -1 before ballots (insert_and_find CG).
Compute pointer only in src_lane and broadcast as intptr_t. Current code evaluates get_slot_ptr even when state stays UNEQUAL, which is UB.

Apply this patch:

@@
-      auto* slot_ptr = this->get_slot_ptr(*probing_iter, intra_bucket_index);
@@
-      if (group_finds_equal) {
+      if (group_finds_equal) {
         auto const src_lane = __ffs(group_finds_equal) - 1;
-        auto const res      = group.shfl(reinterpret_cast<intptr_t>(slot_ptr), src_lane);
-        if (group.thread_rank() == src_lane) {
-          if constexpr (has_payload) {
-            // wait to ensure that the write to the value part also took place
-            this->wait_for_payload(slot_ptr->second, this->empty_value_sentinel());
-          }
-        }
-        group.sync();
-        return {iterator{reinterpret_cast<value_type*>(res)}, false};
+        intptr_t ptr_val = 0;
+        if (group.thread_rank() == src_lane) {
+          auto* slot_ptr = this->get_slot_ptr(*probing_iter, intra_bucket_index);
+          if constexpr (has_payload) {
+            this->wait_for_payload(slot_ptr->second, this->empty_value_sentinel());
+          }
+          ptr_val = reinterpret_cast<intptr_t>(slot_ptr);
+        }
+        ptr_val = group.shfl(ptr_val, src_lane);
+        return {iterator{reinterpret_cast<value_type*>(ptr_val)}, false};
       }
@@
-      if (group_contains_available) {
+      if (group_contains_available) {
         auto const src_lane = __ffs(group_contains_available) - 1;
-        auto const res      = group.shfl(reinterpret_cast<intptr_t>(slot_ptr), src_lane);
-        auto const status   = [&, target_idx = intra_bucket_index]() {
-          if (group.thread_rank() != src_lane) { return insert_result::CONTINUE; }
-          return this->attempt_insert_stable(slot_ptr, bucket_slots[target_idx], val);
-        }();
+        intptr_t ptr_val = 0;
+        auto const status = [&, target_idx = intra_bucket_index]() {
+          if (group.thread_rank() != src_lane) { return insert_result::CONTINUE; }
+          auto* slot_ptr = this->get_slot_ptr(*probing_iter, intra_bucket_index);
+          ptr_val        = reinterpret_cast<intptr_t>(slot_ptr);
+          return this->attempt_insert_stable(slot_ptr, bucket_slots[target_idx], val);
+        }();
@@
-            group.sync();
-            return {iterator{reinterpret_cast<value_type*>(res)}, true};
+            ptr_val = group.shfl(ptr_val, src_lane);
+            return {iterator{reinterpret_cast<value_type*>(ptr_val)}, true};
@@
-            group.sync();
-            return {iterator{reinterpret_cast<value_type*>(res)}, false};
+            ptr_val = group.shfl(ptr_val, src_lane);
+            return {iterator{reinterpret_cast<value_type*>(ptr_val)}, false};

Also applies to: 644-675


946-952: Fix UB similarly in find(CG) equal path.
Broadcast only from src_lane; compute pointer in that lane.

-        auto const res      = group.shfl(
-          reinterpret_cast<intptr_t>(this->get_slot_ptr(*probing_iter, intra_bucket_index)),
-          src_lane);
-        return iterator{reinterpret_cast<value_type*>(res)};
+        intptr_t ptr_val = 0;
+        if (group.thread_rank() == src_lane) {
+          ptr_val = reinterpret_cast<intptr_t>(
+            this->get_slot_ptr(*probing_iter, intra_bucket_index));
+        }
+        ptr_val = group.shfl(ptr_val, src_lane);
+        return iterator{reinterpret_cast<value_type*>(ptr_val)};
🧹 Nitpick comments (3)
include/cuco/detail/hyperloglog/kernels.cuh (1)

75-78: Remainder handling via cuda::static_for looks correct.
Bounds guard i() < remainder prevents OOB; reverse indexing is fine. Optional: read forward as first + (n - remainder) + i() for readability.

Also applies to: 82-85

include/cuco/detail/bloom_filter/bloom_filter_impl.cuh (2)

208-213: Nit: redundant j() < ... guards.
cuda::static_for iterates [0, N), so j() < num_threads / worker_num_threads is always true. You can drop that part to simplify the predicate.

-        cuda::static_for<num_threads>([&](auto j) {
-          if ((j() < num_threads) and (i + j() < num_keys)) {
+        cuda::static_for<num_threads>([&](auto j) {
+          if (i + j() < num_keys) {
             this->add_impl(group, group.shfl(hash_value, j()), group.shfl(block_index, j()));
           }
         });
@@
-        cuda::static_for<worker_num_threads>([&](auto j) {
-          if ((j() < worker_num_threads) and (i + worker_offset + j() < num_keys)) {
+        cuda::static_for<worker_num_threads>([&](auto j) {
+          if (i + worker_offset + j() < num_keys) {
             this->add_impl(worker_group,
                            worker_group.shfl(hash_value, j()),
                            worker_group.shfl(block_index, j()));
           }
         });

Also applies to: 230-237


364-375: Optional: short‑circuit inner loop when mismatch is found.
Guard inner static_for body with if (success) to avoid extra loads once false.

-          cuda::static_for<words_per_thread>([&](auto j) {
-            auto const expected_pattern = policy_.word_pattern(hash_value, thread_offset + j());
-            if ((stored_pattern[j()] & expected_pattern) != expected_pattern) { success = false; }
-          });
+          cuda::static_for<words_per_thread>([&](auto j) {
+            if (success) {
+              auto const expected_pattern =
+                policy_.word_pattern(hash_value, thread_offset + j());
+              if ((stored_pattern[j()] & expected_pattern) != expected_pattern) { success = false; }
+            }
+          });
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7283ed and fabc460.

📒 Files selected for processing (5)
  • include/cuco/detail/bloom_filter/bloom_filter_impl.cuh (7 hunks)
  • include/cuco/detail/hyperloglog/kernels.cuh (2 hunks)
  • include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh (11 hunks)
  • include/cuco/detail/roaring_bitmap/roaring_bitmap_impl.cuh (1 hunks)
  • include/cuco/detail/static_map/static_map_ref.inl (3 hunks)
🔇 Additional comments (11)
include/cuco/detail/hyperloglog/kernels.cuh (1)

23-23: LGTM: required header for static_for.
#include <cuda/utility> addition is correct and consistent with cuda::static_for usage.

include/cuco/detail/static_map/static_map_ref.inl (1)

27-27: LGTM: header include.
#include <cuda/utility> is required for cuda::static_for and related utilities.

include/cuco/detail/bloom_filter/bloom_filter_impl.cuh (4)

146-154: LGTM: per-word static_for + atomic OR.
Clear, fully unrolled; avoids loop-carried deps and keeps atomics coalesced.


253-260: LGTM: CG-strided static_for path.
Correct per-rank striding over words_per_block; coalesced atomics preserved.


339-345: LGTM: scalar contains with static_for.
Fully unrolled check with short-circuit flag is fine.


37-37: LGTM: header include.
#include <cuda/utility> is necessary for cuda::static_for.

include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh (5)

444-464: LGTM: static_for-based per-bucket scans.
Pattern correctly emulates early-exit via a state accumulator; keeps codegen predictable.

Also applies to: 750-759, 932-942


987-994: LGTM: count() static_for population.
Per-slot result accumulation with arrays is clear; EMPTY short-circuit preserved.

Also applies to: 1030-1034


1338-1395: LGTM: retrieve_impl bucket pass.
Use of static_for for equals/existence and buffered flush is sound; ballots/shfls are gated correctly.


1459-1477: LGTM: non-CG for_each short-circuit.
Guarding with should_return preserves early-exit semantics after EMPTY.


31-31: LGTM: header include.
#include <cuda/utility> required for cuda::static_for.

@NVIDIA NVIDIA deleted a comment from coderabbitai bot Oct 24, 2025
@NVIDIA NVIDIA deleted a comment from coderabbitai bot Oct 24, 2025
@NVIDIA NVIDIA deleted a comment from coderabbitai bot Oct 24, 2025
@NVIDIA NVIDIA deleted a comment from coderabbitai bot Oct 24, 2025
@NVIDIA NVIDIA deleted a comment from coderabbitai bot Oct 24, 2025
@NVIDIA NVIDIA deleted a comment from coderabbitai bot Oct 24, 2025
Comment on lines +344 to +346
add_impl_functor<HashValue, BlockIndex, policy_type, word_type, thread_scope> functor{
hash_value, block_index, policy_, words_, words_per_block};
cuda::static_for<words_per_block>(functor);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, in all other cases we can use a lambda with static_for without any issues, but in all Bloom filter implementations we have to resort to functors to avoid compiler errors. It’s unclear what makes the Bloom filter implementation behave differently from other use cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No clue what's causing this but since the Bloom filter implementation will be replaced soon I'd say the workaround with the functor is fine.

@PointKernel PointKernel added the Needs Review Awaiting reviews before merging label Oct 24, 2025
Copy link
Collaborator

@sleeepyjack sleeepyjack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Can you verify that we're not introducing any regressions with this? It might me that we are now unrolling loops which were previously not unrolled, potentially adding to the register pressure problem. I'd say comparing benchmarks for static_map and static_multimap should be sufficient.

Comment on lines +44 to 45
[[maybe_unused]] key_generator gen{};
gen.generate(dist_from_state<Dist>(state), keys.begin(), keys.end());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can gen be maybe_unused when it is used immediately in the following line? What am I missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a compiler bug in CUDA 12.0 where host-side use of host-device APIs isn’t correctly recognized.

Comment on lines +344 to +346
add_impl_functor<HashValue, BlockIndex, policy_type, word_type, thread_scope> functor{
hash_value, block_index, policy_, words_, words_per_block};
cuda::static_for<words_per_block>(functor);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No clue what's causing this but since the Bloom filter implementation will be replaced soon I'd say the workaround with the functor is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs Review Awaiting reviews before merging type: improvement Improvement / enhancement to an existing function

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENHANCEMENT]: Use static_for to guarantee compile-time loop unrolling

2 participants