-
Notifications
You must be signed in to change notification settings - Fork 100
Bloom filter optimizations (4/5): Implement device bulk add #672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Benchmark results:Ref=#669 Cmp=#672bloom_filter_add_unique_size[0] NVIDIA H100 80GB HBM3
bloom_filter_add_unique_hash[0] NVIDIA H100 80GB HBM3
bloom_filter_add_unique_block_dim[0] NVIDIA H100 80GB HBM3
arrow_bloom_filter_add_unique_size[0] NVIDIA H100 80GB HBM3
|
|
Profiling showed that we're only achieving 75% occupancy due to register pressure. We're using 36 registers while the optimum is <=32. I'm not sure if it would improve performance if we can get the kernel to use only 32 regs, since the main bottleneck is MIO throttle, i.e., we're thrashing the memory subsystem with write requests. |
Have you tried adding launch bound together with using cuCollections/include/cuco/detail/utility/cuda.hpp Lines 60 to 77 in 6816740
|
|
@PointKernel Ha! I found the 4 registers by setting the |
|
The last two commits improve the occupancy and reduce tail effects. However, these improvements only lead to slight speedups in a few cases. I'd say this is good enough for now. Ref=#669 Cmp=#672bloom_filter_add_unique_size[0] NVIDIA H100 80GB HBM3
bloom_filter_add_unique_hash[0] NVIDIA H100 80GB HBM3
bloom_filter_add_unique_block_dim[0] NVIDIA H100 80GB HBM3
arrow_bloom_filter_add_unique_size[0] NVIDIA H100 80GB HBM3
|
| void bloom_filter_add(nvbench::state& state, | ||
| nvbench::type_list<Key, Hash, Word, nvbench::enum_type<WordsPerBlock>, Dist>) | ||
| { | ||
| using size_type = std::uint32_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime difference brought by the 100% occupancy is not worth the change IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_type = uint32_t should be the default size type for bloom_filter IMO. The only thing preventing us from setting this default in the public API is (1) we use size_t everywhere else in our library to denote sizes (I could start my usual rant about how it was a mistake for STL to choose uint64_t as the default size type). (2) A user might run into the inconvenience of a narrowing conversion if their input value is a size_t.
From an algorithmic standpoint, there is little to no need to use a wider type.
The benefit is that the kernel now achieves near-optimal occupancy, although it's not showing any significant end-to-end effect for our particular benchmark setup. However, lets say a user wants to use a hasher that needs more registers than xxhash64, then, with the 4 more registers available, the compiler has more options to optimize the code before running out of resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are using size_t mainly to match STL and when possible, cudf or the upstream CCCL algorithms always use the signed integer like int32_t/int64_t as size type since it doesn't have the overflow handling.


This PR is part 4/5 of the Bloom filter optimization project and must be merged in the correct order.
This PR introduces the following API:
, i.e., a device-bulk operation for adding multiple items into the filter using a CG.
Using this approach in the (host) bulk
addkernel improves performance due to the following reason:The current kernel uses one CG per input element. Each thread in the CG loads the same key and computes a hash value and target block index. This computation is redundant and leads to increased compute pipeline pressure, which becomes a bottleneck in case the filter is small, i.e., fits into L2$.
In the new kernel, each thread loads one key at a time, computes the hash value and the index, and then shuffles these values to the other threads in the group, which then perform the cooperative
addoperation.