Skip to content

Conversation

@sleeepyjack
Copy link
Collaborator

@sleeepyjack sleeepyjack commented Feb 12, 2025

This PR is part 4/5 of the Bloom filter optimization project and must be merged in the correct order.

This PR introduces the following API:

template <class CG, class InputIt>
__device__ void add(CG const& group, InputIt first, InputIt last)

, i.e., a device-bulk operation for adding multiple items into the filter using a CG.

Using this approach in the (host) bulk add kernel improves performance due to the following reason:
The current kernel uses one CG per input element. Each thread in the CG loads the same key and computes a hash value and target block index. This computation is redundant and leads to increased compute pipeline pressure, which becomes a bottleneck in case the filter is small, i.e., fits into L2$.
In the new kernel, each thread loads one key at a time, computes the hash value and the index, and then shuffles these values to the other threads in the group, which then perform the cooperative add operation.

@sleeepyjack sleeepyjack added helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue type: improvement Improvement / enhancement to an existing function topic: bloom_filter Issues related to bloom_filter labels Feb 12, 2025
@sleeepyjack sleeepyjack self-assigned this Feb 12, 2025
@sleeepyjack
Copy link
Collaborator Author

Benchmark results:

Ref=#669 Cmp=#672

bloom_filter_add_unique_size

[0] NVIDIA H100 80GB HBM3

Key Hash Word WordsPerBlock Distribution NumInputs FilterSizeMB Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I64 xxhash_64 U32 8 UNIQUE 1000000000 1 68.114 ms 0.00% 24.030 ms 0.01% -44084.030 us -64.72% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 2 68.114 ms 0.00% 24.029 ms 0.01% -44084.891 us -64.72% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 4 68.114 ms 0.00% 24.029 ms 0.01% -44085.159 us -64.72% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 8 68.115 ms 0.00% 24.030 ms 0.01% -44084.273 us -64.72% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 16 68.115 ms 0.00% 24.032 ms 0.01% -44082.957 us -64.72% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 32 68.116 ms 0.00% 24.042 ms 0.01% -44074.296 us -64.70% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 64 68.464 ms 0.00% 42.372 ms 0.05% -26092.074 us -38.11% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 128 70.697 ms 0.00% 58.947 ms 0.07% -11749.710 us -16.62% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 256 74.422 ms 0.00% 66.796 ms 0.08% -7626.537 us -10.25% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 512 77.428 ms 0.00% 71.248 ms 0.63% -6179.363 us -7.98% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 1024 79.222 ms 0.00% 74.797 ms 0.73% -4424.728 us -5.59% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 2048 80.193 ms 0.00% 75.954 ms 1.04% -4238.657 us -5.29% FAST

bloom_filter_add_unique_hash

[0] NVIDIA H100 80GB HBM3

Key Hash Word WordsPerBlock Distribution NumInputs FilterSizeMB Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I64 identity_hash U32 8 UNIQUE 1000000000 2000 37.912 ms 0.00% 44.700 ms 0.09% 6.788 ms 17.90% SLOW
I64 xxhash_64 U32 8 UNIQUE 1000000000 2000 80.170 ms 0.00% 75.621 ms 0.97% -4548.209 us -5.67% FAST

bloom_filter_add_unique_block_dim

[0] NVIDIA H100 80GB HBM3

Key Hash Word WordsPerBlock Distribution NumInputs FilterSizeMB Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I64 xxhash_64 U32 1 UNIQUE 1000000000 2000 74.643 ms 0.01% 74.648 ms 0.01% 4.845 us 0.01% SLOW
I64 xxhash_64 U32 2 UNIQUE 1000000000 2000 74.784 ms 0.01% 75.531 ms 0.08% 746.822 us 1.00% SLOW
I64 xxhash_64 U32 4 UNIQUE 1000000000 2000 75.017 ms 0.01% 75.815 ms 0.11% 797.789 us 1.06% SLOW
I64 xxhash_64 U32 8 UNIQUE 1000000000 2000 80.170 ms 0.00% 76.152 ms 0.94% -4017.865 us -5.01% FAST
I64 xxhash_64 U64 1 UNIQUE 1000000000 2000 74.652 ms 0.01% 74.652 ms 0.01% 0.370 us 0.00% SAME
I64 xxhash_64 U64 2 UNIQUE 1000000000 2000 74.774 ms 0.01% 75.570 ms 0.14% 795.296 us 1.06% SLOW
I64 xxhash_64 U64 4 UNIQUE 1000000000 2000 75.041 ms 0.01% 75.922 ms 0.11% 881.271 us 1.17% SLOW
I64 xxhash_64 U64 8 UNIQUE 1000000000 2000 90.665 ms 0.00% 90.758 ms 0.14% 92.846 us 0.10% SLOW

arrow_bloom_filter_add_unique_size

[0] NVIDIA H100 80GB HBM3

Key Distribution NumInputs FilterSizeMB Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I64 UNIQUE 1000000000 1 85.883 ms 0.01% 24.326 ms 0.00% -61556.527 us -71.67% FAST
I64 UNIQUE 1000000000 2 85.530 ms 0.01% 24.326 ms 0.01% -61203.391 us -71.56% FAST
I64 UNIQUE 1000000000 4 85.575 ms 0.01% 24.326 ms 0.01% -61248.779 us -71.57% FAST
I64 UNIQUE 1000000000 8 85.581 ms 0.01% 24.328 ms 0.01% -61253.098 us -71.57% FAST
I64 UNIQUE 1000000000 16 85.665 ms 0.01% 24.331 ms 0.01% -61334.338 us -71.60% FAST
I64 UNIQUE 1000000000 32 85.673 ms 0.01% 24.345 ms 0.01% -61328.047 us -71.58% FAST
I64 UNIQUE 1000000000 64 85.723 ms 0.01% 42.384 ms 0.06% -43338.812 us -50.56% FAST
I64 UNIQUE 1000000000 128 86.054 ms 0.01% 58.863 ms 0.09% -27190.534 us -31.60% FAST

@sleeepyjack
Copy link
Collaborator Author

Profiling showed that we're only achieving 75% occupancy due to register pressure. We're using 36 registers while the optimum is <=32. I'm not sure if it would improve performance if we can get the kernel to use only 32 regs, since the main bottleneck is MIO throttle, i.e., we're thrashing the memory subsystem with write requests.
However, if someone finds a trick that safes us those 4 registers you're my hero of the day.

@PointKernel
Copy link
Member

PointKernel commented Feb 13, 2025

Profiling showed that we're only achieving 75% occupancy due to register pressure. We're using 36 registers while the optimum is <=32. I'm not sure if it would improve performance if we can get the kernel to use only 32 regs, since the main bottleneck is MIO throttle, i.e., we're thrashing the memory subsystem with write requests. However, if someone finds a trick that safes us those 4 registers you're my hero of the day.

Have you tried adding launch bound together with using

template <typename Kernel>
constexpr auto max_occupancy_grid_size(int32_t block_size,
Kernel kernel,
std::size_t dynamic_shm_size = 0)
{
int32_t device = 0;
CUCO_CUDA_TRY(cudaGetDevice(&device));
int32_t num_multiprocessors = -1;
CUCO_CUDA_TRY(
cudaDeviceGetAttribute(&num_multiprocessors, cudaDevAttrMultiProcessorCount, device));
int32_t max_active_blocks_per_multiprocessor;
CUCO_CUDA_TRY(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_active_blocks_per_multiprocessor, kernel, block_size, dynamic_shm_size));
return max_active_blocks_per_multiprocessor * num_multiprocessors;
}
?

@sleeepyjack
Copy link
Collaborator Author

Have you tried adding launch bound together with using

Yes, both launch bounds and max_occupancy_grid_size are used. ncu tells me that 36 registers are used by the add kernel, but when I go into SASS/source view I only see a maximum of 33 live registers at a time.
Bildschirmfoto 2025-02-13 um 02 22 32

@sleeepyjack
Copy link
Collaborator Author

sleeepyjack commented Feb 13, 2025

@PointKernel Ha! I found the 4 registers by setting the size_type to uint32_t instead of size_t in the benchmark, which is generally applicable as even in the worst case setup (4byte word type and 1 word per block) this still allows us to build a 16GB filter which is HUGE in real world scenarios. Occupancy is now 100% theoretical (84% achieved) and initial results are slightly better. Will update the branch and benchmarks soon.
Bildschirmfoto 2025-02-13 um 03 21 41

@sleeepyjack sleeepyjack marked this pull request as ready for review February 13, 2025 16:05
@sleeepyjack
Copy link
Collaborator Author

The last two commits improve the occupancy and reduce tail effects. However, these improvements only lead to slight speedups in a few cases. I'd say this is good enough for now.

Ref=#669 Cmp=#672

bloom_filter_add_unique_size

[0] NVIDIA H100 80GB HBM3

Key Hash Word WordsPerBlock Distribution NumInputs FilterSizeMB Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I64 xxhash_64 U32 8 UNIQUE 1000000000 1 68.114 ms 0.00% 24.090 ms 0.07% -44023.930 us -64.63% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 2 68.114 ms 0.00% 24.089 ms 0.05% -44025.249 us -64.63% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 4 68.114 ms 0.00% 24.058 ms 0.07% -44055.963 us -64.68% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 8 68.115 ms 0.00% 23.867 ms 0.08% -44247.179 us -64.96% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 16 68.115 ms 0.00% 23.784 ms 0.06% -44331.287 us -65.08% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 32 68.116 ms 0.00% 23.789 ms 0.04% -44326.856 us -65.08% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 64 68.464 ms 0.00% 42.505 ms 0.06% -25959.471 us -37.92% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 128 70.697 ms 0.00% 58.930 ms 0.08% -11767.102 us -16.64% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 256 74.422 ms 0.00% 66.786 ms 0.05% -7635.832 us -10.26% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 512 77.428 ms 0.00% 71.080 ms 0.06% -6347.654 us -8.20% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 1024 79.222 ms 0.00% 73.591 ms 0.05% -5630.256 us -7.11% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 2048 80.193 ms 0.00% 75.106 ms 0.05% -5086.486 us -6.34% FAST

bloom_filter_add_unique_hash

[0] NVIDIA H100 80GB HBM3

Key Hash Word WordsPerBlock Distribution NumInputs FilterSizeMB Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I64 identity_hash U32 8 UNIQUE 1000000000 2000 37.912 ms 0.00% 45.471 ms 0.14% 7.559 ms 19.94% SLOW
I64 xxhash_64 U32 8 UNIQUE 1000000000 2000 80.170 ms 0.00% 75.073 ms 0.06% -5097.115 us -6.36% FAST

bloom_filter_add_unique_block_dim

[0] NVIDIA H100 80GB HBM3

Key Hash Word WordsPerBlock Distribution NumInputs FilterSizeMB Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I64 xxhash_64 U32 1 UNIQUE 1000000000 2000 74.643 ms 0.01% 74.655 ms 0.01% 12.662 us 0.02% SLOW
I64 xxhash_64 U32 2 UNIQUE 1000000000 2000 74.784 ms 0.01% 74.905 ms 0.02% 120.429 us 0.16% SLOW
I64 xxhash_64 U32 4 UNIQUE 1000000000 2000 75.017 ms 0.01% 74.958 ms 0.02% -58.932 us -0.08% FAST
I64 xxhash_64 U32 8 UNIQUE 1000000000 2000 80.170 ms 0.00% 75.096 ms 0.09% -5073.846 us -6.33% FAST
I64 xxhash_64 U64 1 UNIQUE 1000000000 2000 74.652 ms 0.01% 74.661 ms 0.01% 9.326 us 0.01% SLOW
I64 xxhash_64 U64 2 UNIQUE 1000000000 2000 74.774 ms 0.01% 74.938 ms 0.04% 163.691 us 0.22% SLOW
I64 xxhash_64 U64 4 UNIQUE 1000000000 2000 75.041 ms 0.01% 75.046 ms 0.12% 4.667 us 0.01% SAME
I64 xxhash_64 U64 8 UNIQUE 1000000000 2000 90.665 ms 0.00% 89.510 ms 0.07% -1155.218 us -1.27% FAST

arrow_bloom_filter_add_unique_size

[0] NVIDIA H100 80GB HBM3

Key Distribution NumInputs FilterSizeMB Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I64 UNIQUE 1000000000 1 85.883 ms 0.01% 24.614 ms 0.01% -61269.035 us -71.34% FAST
I64 UNIQUE 1000000000 2 85.530 ms 0.01% 24.581 ms 0.01% -60948.816 us -71.26% FAST
I64 UNIQUE 1000000000 4 85.575 ms 0.01% 24.581 ms 0.01% -60994.548 us -71.28% FAST
I64 UNIQUE 1000000000 8 85.581 ms 0.01% 24.582 ms 0.01% -60999.083 us -71.28% FAST
I64 UNIQUE 1000000000 16 85.665 ms 0.01% 24.589 ms 0.05% -61076.038 us -71.30% FAST
I64 UNIQUE 1000000000 32 85.673 ms 0.01% 24.597 ms 0.01% -61075.715 us -71.29% FAST
I64 UNIQUE 1000000000 64 85.723 ms 0.01% 42.263 ms 0.09% -43459.756 us -50.70% FAST
I64 UNIQUE 1000000000 128 86.054 ms 0.01% 58.858 ms 0.06% -27195.454 us -31.60% FAST

void bloom_filter_add(nvbench::state& state,
nvbench::type_list<Key, Hash, Word, nvbench::enum_type<WordsPerBlock>, Dist>)
{
using size_type = std::uint32_t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime difference brought by the 100% occupancy is not worth the change IMO.

Copy link
Collaborator Author

@sleeepyjack sleeepyjack Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_type = uint32_t should be the default size type for bloom_filter IMO. The only thing preventing us from setting this default in the public API is (1) we use size_t everywhere else in our library to denote sizes (I could start my usual rant about how it was a mistake for STL to choose uint64_t as the default size type). (2) A user might run into the inconvenience of a narrowing conversion if their input value is a size_t.

From an algorithmic standpoint, there is little to no need to use a wider type.

The benefit is that the kernel now achieves near-optimal occupancy, although it's not showing any significant end-to-end effect for our particular benchmark setup. However, lets say a user wants to use a hasher that needs more registers than xxhash64, then, with the 4 more registers available, the compiler has more options to optimize the code before running out of resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are using size_t mainly to match STL and when possible, cudf or the upstream CCCL algorithms always use the signed integer like int32_t/int64_t as size type since it doesn't have the overflow handling.

@sleeepyjack sleeepyjack merged commit 459d1d6 into NVIDIA:dev Feb 20, 2025
18 checks passed
@sleeepyjack sleeepyjack deleted the bf-device-bulk-add branch February 20, 2025 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

helps: rapids Helps or needed by RAPIDS topic: bloom_filter Issues related to bloom_filter topic: performance Performance related issue type: improvement Improvement / enhancement to an existing function

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants