-
Notifications
You must be signed in to change notification settings - Fork 100
Use static_for for fixed-size loops #771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh (2)
627-648: Fix UB: avoid computing slot_ptr with intra_bucket_index = -1 before ballots (insert_and_find CG).
Compute pointer only in src_lane and broadcast as intptr_t. Current code evaluates get_slot_ptr even when state stays UNEQUAL, which is UB.Apply this patch:
@@ - auto* slot_ptr = this->get_slot_ptr(*probing_iter, intra_bucket_index); @@ - if (group_finds_equal) { + if (group_finds_equal) { auto const src_lane = __ffs(group_finds_equal) - 1; - auto const res = group.shfl(reinterpret_cast<intptr_t>(slot_ptr), src_lane); - if (group.thread_rank() == src_lane) { - if constexpr (has_payload) { - // wait to ensure that the write to the value part also took place - this->wait_for_payload(slot_ptr->second, this->empty_value_sentinel()); - } - } - group.sync(); - return {iterator{reinterpret_cast<value_type*>(res)}, false}; + intptr_t ptr_val = 0; + if (group.thread_rank() == src_lane) { + auto* slot_ptr = this->get_slot_ptr(*probing_iter, intra_bucket_index); + if constexpr (has_payload) { + this->wait_for_payload(slot_ptr->second, this->empty_value_sentinel()); + } + ptr_val = reinterpret_cast<intptr_t>(slot_ptr); + } + ptr_val = group.shfl(ptr_val, src_lane); + return {iterator{reinterpret_cast<value_type*>(ptr_val)}, false}; } @@ - if (group_contains_available) { + if (group_contains_available) { auto const src_lane = __ffs(group_contains_available) - 1; - auto const res = group.shfl(reinterpret_cast<intptr_t>(slot_ptr), src_lane); - auto const status = [&, target_idx = intra_bucket_index]() { - if (group.thread_rank() != src_lane) { return insert_result::CONTINUE; } - return this->attempt_insert_stable(slot_ptr, bucket_slots[target_idx], val); - }(); + intptr_t ptr_val = 0; + auto const status = [&, target_idx = intra_bucket_index]() { + if (group.thread_rank() != src_lane) { return insert_result::CONTINUE; } + auto* slot_ptr = this->get_slot_ptr(*probing_iter, intra_bucket_index); + ptr_val = reinterpret_cast<intptr_t>(slot_ptr); + return this->attempt_insert_stable(slot_ptr, bucket_slots[target_idx], val); + }(); @@ - group.sync(); - return {iterator{reinterpret_cast<value_type*>(res)}, true}; + ptr_val = group.shfl(ptr_val, src_lane); + return {iterator{reinterpret_cast<value_type*>(ptr_val)}, true}; @@ - group.sync(); - return {iterator{reinterpret_cast<value_type*>(res)}, false}; + ptr_val = group.shfl(ptr_val, src_lane); + return {iterator{reinterpret_cast<value_type*>(ptr_val)}, false};Also applies to: 644-675
946-952: Fix UB similarly in find(CG) equal path.
Broadcast only from src_lane; compute pointer in that lane.- auto const res = group.shfl( - reinterpret_cast<intptr_t>(this->get_slot_ptr(*probing_iter, intra_bucket_index)), - src_lane); - return iterator{reinterpret_cast<value_type*>(res)}; + intptr_t ptr_val = 0; + if (group.thread_rank() == src_lane) { + ptr_val = reinterpret_cast<intptr_t>( + this->get_slot_ptr(*probing_iter, intra_bucket_index)); + } + ptr_val = group.shfl(ptr_val, src_lane); + return iterator{reinterpret_cast<value_type*>(ptr_val)};
🧹 Nitpick comments (3)
include/cuco/detail/hyperloglog/kernels.cuh (1)
75-78: Remainder handling via cuda::static_for looks correct.
Bounds guard i() < remainder prevents OOB; reverse indexing is fine. Optional: read forward as first + (n - remainder) + i() for readability.Also applies to: 82-85
include/cuco/detail/bloom_filter/bloom_filter_impl.cuh (2)
208-213: Nit: redundant j() < ... guards.
cuda::static_for iterates [0, N), so j() < num_threads / worker_num_threads is always true. You can drop that part to simplify the predicate.- cuda::static_for<num_threads>([&](auto j) { - if ((j() < num_threads) and (i + j() < num_keys)) { + cuda::static_for<num_threads>([&](auto j) { + if (i + j() < num_keys) { this->add_impl(group, group.shfl(hash_value, j()), group.shfl(block_index, j())); } }); @@ - cuda::static_for<worker_num_threads>([&](auto j) { - if ((j() < worker_num_threads) and (i + worker_offset + j() < num_keys)) { + cuda::static_for<worker_num_threads>([&](auto j) { + if (i + worker_offset + j() < num_keys) { this->add_impl(worker_group, worker_group.shfl(hash_value, j()), worker_group.shfl(block_index, j())); } });Also applies to: 230-237
364-375: Optional: short‑circuit inner loop when mismatch is found.
Guard inner static_for body with if (success) to avoid extra loads once false.- cuda::static_for<words_per_thread>([&](auto j) { - auto const expected_pattern = policy_.word_pattern(hash_value, thread_offset + j()); - if ((stored_pattern[j()] & expected_pattern) != expected_pattern) { success = false; } - }); + cuda::static_for<words_per_thread>([&](auto j) { + if (success) { + auto const expected_pattern = + policy_.word_pattern(hash_value, thread_offset + j()); + if ((stored_pattern[j()] & expected_pattern) != expected_pattern) { success = false; } + } + });
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
include/cuco/detail/bloom_filter/bloom_filter_impl.cuh(7 hunks)include/cuco/detail/hyperloglog/kernels.cuh(2 hunks)include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh(11 hunks)include/cuco/detail/roaring_bitmap/roaring_bitmap_impl.cuh(1 hunks)include/cuco/detail/static_map/static_map_ref.inl(3 hunks)
🔇 Additional comments (11)
include/cuco/detail/hyperloglog/kernels.cuh (1)
23-23: LGTM: required header for static_for.
#include <cuda/utility> addition is correct and consistent with cuda::static_for usage.include/cuco/detail/static_map/static_map_ref.inl (1)
27-27: LGTM: header include.
#include <cuda/utility> is required for cuda::static_for and related utilities.include/cuco/detail/bloom_filter/bloom_filter_impl.cuh (4)
146-154: LGTM: per-word static_for + atomic OR.
Clear, fully unrolled; avoids loop-carried deps and keeps atomics coalesced.
253-260: LGTM: CG-strided static_for path.
Correct per-rank striding over words_per_block; coalesced atomics preserved.
339-345: LGTM: scalar contains with static_for.
Fully unrolled check with short-circuit flag is fine.
37-37: LGTM: header include.
#include <cuda/utility> is necessary for cuda::static_for.include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh (5)
444-464: LGTM: static_for-based per-bucket scans.
Pattern correctly emulates early-exit via a state accumulator; keeps codegen predictable.Also applies to: 750-759, 932-942
987-994: LGTM: count() static_for population.
Per-slot result accumulation with arrays is clear; EMPTY short-circuit preserved.Also applies to: 1030-1034
1338-1395: LGTM: retrieve_impl bucket pass.
Use of static_for for equals/existence and buffered flush is sound; ballots/shfls are gated correctly.
1459-1477: LGTM: non-CG for_each short-circuit.
Guarding with should_return preserves early-exit semantics after EMPTY.
31-31: LGTM: header include.
#include <cuda/utility> required for cuda::static_for.
| add_impl_functor<HashValue, BlockIndex, policy_type, word_type, thread_scope> functor{ | ||
| hash_value, block_index, policy_, words_, words_per_block}; | ||
| cuda::static_for<words_per_block>(functor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, in all other cases we can use a lambda with static_for without any issues, but in all Bloom filter implementations we have to resort to functors to avoid compiler errors. It’s unclear what makes the Bloom filter implementation behave differently from other use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No clue what's causing this but since the Bloom filter implementation will be replaced soon I'd say the workaround with the functor is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Can you verify that we're not introducing any regressions with this? It might me that we are now unrolling loops which were previously not unrolled, potentially adding to the register pressure problem. I'd say comparing benchmarks for static_map and static_multimap should be sufficient.
| [[maybe_unused]] key_generator gen{}; | ||
| gen.generate(dist_from_state<Dist>(state), keys.begin(), keys.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can gen be maybe_unused when it is used immediately in the following line? What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s a compiler bug in CUDA 12.0 where host-side use of host-device APIs isn’t correctly recognized.
| add_impl_functor<HashValue, BlockIndex, policy_type, word_type, thread_scope> functor{ | ||
| hash_value, block_index, policy_, words_, words_per_block}; | ||
| cuda::static_for<words_per_block>(functor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No clue what's causing this but since the Bloom filter implementation will be replaced soon I'd say the workaround with the functor is fine.
Closes #770
This PR replaces naive for loops with
cuda::static_forin cases where the loop size is known at compile time, enabling guaranteed compile-time unrolling and improving runtime performance. It also updates CMake to treat device compiler warnings as errors and addresses warnings for variables likegenbeing declared but unused, which appears to be a compiler bug in CUDA 12.0.