Bloom filter optimizations (4/5): Implement device bulk add #672

sleeepyjack · 2025-02-12T16:37:59Z

This PR is part 4/5 of the Bloom filter optimization project and must be merged in the correct order.

This PR introduces the following API:

template <class CG, class InputIt>
__device__ void add(CG const& group, InputIt first, InputIt last)

, i.e., a device-bulk operation for adding multiple items into the filter using a CG.

Using this approach in the (host) bulk add kernel improves performance due to the following reason:
The current kernel uses one CG per input element. Each thread in the CG loads the same key and computes a hash value and target block index. This computation is redundant and leads to increased compute pipeline pressure, which becomes a bottleneck in case the filter is small, i.e., fits into L2$.
In the new kernel, each thread loads one key at a time, computes the hash value and the index, and then shuffles these values to the other threads in the group, which then perform the cooperative add operation.

sleeepyjack · 2025-02-13T00:17:50Z

Benchmark results:

Ref=#669 Cmp=#672

bloom_filter_add_unique_size

[0] NVIDIA H100 80GB HBM3

Key	Hash	Word	WordsPerBlock	Distribution	NumInputs	FilterSizeMB	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I64	xxhash_64	U32	8	UNIQUE	1000000000	1	68.114 ms	0.00%	24.030 ms	0.01%	-44084.030 us	-64.72%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	2	68.114 ms	0.00%	24.029 ms	0.01%	-44084.891 us	-64.72%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	4	68.114 ms	0.00%	24.029 ms	0.01%	-44085.159 us	-64.72%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	8	68.115 ms	0.00%	24.030 ms	0.01%	-44084.273 us	-64.72%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	16	68.115 ms	0.00%	24.032 ms	0.01%	-44082.957 us	-64.72%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	32	68.116 ms	0.00%	24.042 ms	0.01%	-44074.296 us	-64.70%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	64	68.464 ms	0.00%	42.372 ms	0.05%	-26092.074 us	-38.11%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	128	70.697 ms	0.00%	58.947 ms	0.07%	-11749.710 us	-16.62%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	256	74.422 ms	0.00%	66.796 ms	0.08%	-7626.537 us	-10.25%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	512	77.428 ms	0.00%	71.248 ms	0.63%	-6179.363 us	-7.98%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	1024	79.222 ms	0.00%	74.797 ms	0.73%	-4424.728 us	-5.59%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	2048	80.193 ms	0.00%	75.954 ms	1.04%	-4238.657 us	-5.29%	FAST

bloom_filter_add_unique_hash

[0] NVIDIA H100 80GB HBM3

Key	Hash	Word	WordsPerBlock	Distribution	NumInputs	FilterSizeMB	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I64	identity_hash	U32	8	UNIQUE	1000000000	2000	37.912 ms	0.00%	44.700 ms	0.09%	6.788 ms	17.90%	SLOW
I64	xxhash_64	U32	8	UNIQUE	1000000000	2000	80.170 ms	0.00%	75.621 ms	0.97%	-4548.209 us	-5.67%	FAST

bloom_filter_add_unique_block_dim

[0] NVIDIA H100 80GB HBM3

Key	Hash	Word	WordsPerBlock	Distribution	NumInputs	FilterSizeMB	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I64	xxhash_64	U32	1	UNIQUE	1000000000	2000	74.643 ms	0.01%	74.648 ms	0.01%	4.845 us	0.01%	SLOW
I64	xxhash_64	U32	2	UNIQUE	1000000000	2000	74.784 ms	0.01%	75.531 ms	0.08%	746.822 us	1.00%	SLOW
I64	xxhash_64	U32	4	UNIQUE	1000000000	2000	75.017 ms	0.01%	75.815 ms	0.11%	797.789 us	1.06%	SLOW
I64	xxhash_64	U32	8	UNIQUE	1000000000	2000	80.170 ms	0.00%	76.152 ms	0.94%	-4017.865 us	-5.01%	FAST
I64	xxhash_64	U64	1	UNIQUE	1000000000	2000	74.652 ms	0.01%	74.652 ms	0.01%	0.370 us	0.00%	SAME
I64	xxhash_64	U64	2	UNIQUE	1000000000	2000	74.774 ms	0.01%	75.570 ms	0.14%	795.296 us	1.06%	SLOW
I64	xxhash_64	U64	4	UNIQUE	1000000000	2000	75.041 ms	0.01%	75.922 ms	0.11%	881.271 us	1.17%	SLOW
I64	xxhash_64	U64	8	UNIQUE	1000000000	2000	90.665 ms	0.00%	90.758 ms	0.14%	92.846 us	0.10%	SLOW

arrow_bloom_filter_add_unique_size

[0] NVIDIA H100 80GB HBM3

Key	Distribution	NumInputs	FilterSizeMB	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I64	UNIQUE	1000000000	1	85.883 ms	0.01%	24.326 ms	0.00%	-61556.527 us	-71.67%	FAST
I64	UNIQUE	1000000000	2	85.530 ms	0.01%	24.326 ms	0.01%	-61203.391 us	-71.56%	FAST
I64	UNIQUE	1000000000	4	85.575 ms	0.01%	24.326 ms	0.01%	-61248.779 us	-71.57%	FAST
I64	UNIQUE	1000000000	8	85.581 ms	0.01%	24.328 ms	0.01%	-61253.098 us	-71.57%	FAST
I64	UNIQUE	1000000000	16	85.665 ms	0.01%	24.331 ms	0.01%	-61334.338 us	-71.60%	FAST
I64	UNIQUE	1000000000	32	85.673 ms	0.01%	24.345 ms	0.01%	-61328.047 us	-71.58%	FAST
I64	UNIQUE	1000000000	64	85.723 ms	0.01%	42.384 ms	0.06%	-43338.812 us	-50.56%	FAST
I64	UNIQUE	1000000000	128	86.054 ms	0.01%	58.863 ms	0.09%	-27190.534 us	-31.60%	FAST

sleeepyjack · 2025-02-13T01:01:11Z

Profiling showed that we're only achieving 75% occupancy due to register pressure. We're using 36 registers while the optimum is <=32. I'm not sure if it would improve performance if we can get the kernel to use only 32 regs, since the main bottleneck is MIO throttle, i.e., we're thrashing the memory subsystem with write requests.
However, if someone finds a trick that safes us those 4 registers you're my hero of the day.

PointKernel · 2025-02-13T01:10:07Z

Profiling showed that we're only achieving 75% occupancy due to register pressure. We're using 36 registers while the optimum is <=32. I'm not sure if it would improve performance if we can get the kernel to use only 32 regs, since the main bottleneck is MIO throttle, i.e., we're thrashing the memory subsystem with write requests. However, if someone finds a trick that safes us those 4 registers you're my hero of the day.

Have you tried adding launch bound together with using

cuCollections/include/cuco/detail/utility/cuda.hpp

Lines 60 to 77 in 6816740

    
           template <typename Kernel> 
        
           constexpr auto max_occupancy_grid_size(int32_t block_size, 
        
                                                  Kernel kernel, 
        
                                                  std::size_t dynamic_shm_size = 0) 
        
           { 
        
             int32_t device = 0; 
        
             CUCO_CUDA_TRY(cudaGetDevice(&device)); 
        
             int32_t num_multiprocessors = -1; 
        
             CUCO_CUDA_TRY( 
        
               cudaDeviceGetAttribute(&num_multiprocessors, cudaDevAttrMultiProcessorCount, device)); 
        
             int32_t max_active_blocks_per_multiprocessor; 
        
             CUCO_CUDA_TRY(cudaOccupancyMaxActiveBlocksPerMultiprocessor( 
        
               &max_active_blocks_per_multiprocessor, kernel, block_size, dynamic_shm_size)); 
        
             return max_active_blocks_per_multiprocessor * num_multiprocessors; 
        
           }

?

sleeepyjack · 2025-02-13T01:22:47Z

Have you tried adding launch bound together with using

Yes, both launch bounds and max_occupancy_grid_size are used. ncu tells me that 36 registers are used by the add kernel, but when I go into SASS/source view I only see a maximum of 33 live registers at a time.

sleeepyjack · 2025-02-13T02:05:39Z

@PointKernel Ha! I found the 4 registers by setting the size_type to uint32_t instead of size_t in the benchmark, which is generally applicable as even in the worst case setup (4byte word type and 1 word per block) this still allows us to build a 16GB filter which is HUGE in real world scenarios. Occupancy is now 100% theoretical (84% achieved) and initial results are slightly better. Will update the branch and benchmarks soon.

sleeepyjack · 2025-02-13T16:08:57Z

The last two commits improve the occupancy and reduce tail effects. However, these improvements only lead to slight speedups in a few cases. I'd say this is good enough for now.

Ref=#669 Cmp=#672

bloom_filter_add_unique_size

[0] NVIDIA H100 80GB HBM3

Key	Hash	Word	WordsPerBlock	Distribution	NumInputs	FilterSizeMB	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I64	xxhash_64	U32	8	UNIQUE	1000000000	1	68.114 ms	0.00%	24.090 ms	0.07%	-44023.930 us	-64.63%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	2	68.114 ms	0.00%	24.089 ms	0.05%	-44025.249 us	-64.63%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	4	68.114 ms	0.00%	24.058 ms	0.07%	-44055.963 us	-64.68%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	8	68.115 ms	0.00%	23.867 ms	0.08%	-44247.179 us	-64.96%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	16	68.115 ms	0.00%	23.784 ms	0.06%	-44331.287 us	-65.08%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	32	68.116 ms	0.00%	23.789 ms	0.04%	-44326.856 us	-65.08%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	64	68.464 ms	0.00%	42.505 ms	0.06%	-25959.471 us	-37.92%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	128	70.697 ms	0.00%	58.930 ms	0.08%	-11767.102 us	-16.64%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	256	74.422 ms	0.00%	66.786 ms	0.05%	-7635.832 us	-10.26%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	512	77.428 ms	0.00%	71.080 ms	0.06%	-6347.654 us	-8.20%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	1024	79.222 ms	0.00%	73.591 ms	0.05%	-5630.256 us	-7.11%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	2048	80.193 ms	0.00%	75.106 ms	0.05%	-5086.486 us	-6.34%	FAST

bloom_filter_add_unique_hash

[0] NVIDIA H100 80GB HBM3

Key	Hash	Word	WordsPerBlock	Distribution	NumInputs	FilterSizeMB	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I64	identity_hash	U32	8	UNIQUE	1000000000	2000	37.912 ms	0.00%	45.471 ms	0.14%	7.559 ms	19.94%	SLOW
I64	xxhash_64	U32	8	UNIQUE	1000000000	2000	80.170 ms	0.00%	75.073 ms	0.06%	-5097.115 us	-6.36%	FAST

bloom_filter_add_unique_block_dim

[0] NVIDIA H100 80GB HBM3

Key	Hash	Word	WordsPerBlock	Distribution	NumInputs	FilterSizeMB	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I64	xxhash_64	U32	1	UNIQUE	1000000000	2000	74.643 ms	0.01%	74.655 ms	0.01%	12.662 us	0.02%	SLOW
I64	xxhash_64	U32	2	UNIQUE	1000000000	2000	74.784 ms	0.01%	74.905 ms	0.02%	120.429 us	0.16%	SLOW
I64	xxhash_64	U32	4	UNIQUE	1000000000	2000	75.017 ms	0.01%	74.958 ms	0.02%	-58.932 us	-0.08%	FAST
I64	xxhash_64	U32	8	UNIQUE	1000000000	2000	80.170 ms	0.00%	75.096 ms	0.09%	-5073.846 us	-6.33%	FAST
I64	xxhash_64	U64	1	UNIQUE	1000000000	2000	74.652 ms	0.01%	74.661 ms	0.01%	9.326 us	0.01%	SLOW
I64	xxhash_64	U64	2	UNIQUE	1000000000	2000	74.774 ms	0.01%	74.938 ms	0.04%	163.691 us	0.22%	SLOW
I64	xxhash_64	U64	4	UNIQUE	1000000000	2000	75.041 ms	0.01%	75.046 ms	0.12%	4.667 us	0.01%	SAME
I64	xxhash_64	U64	8	UNIQUE	1000000000	2000	90.665 ms	0.00%	89.510 ms	0.07%	-1155.218 us	-1.27%	FAST

arrow_bloom_filter_add_unique_size

[0] NVIDIA H100 80GB HBM3

Key	Distribution	NumInputs	FilterSizeMB	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I64	UNIQUE	1000000000	1	85.883 ms	0.01%	24.614 ms	0.01%	-61269.035 us	-71.34%	FAST
I64	UNIQUE	1000000000	2	85.530 ms	0.01%	24.581 ms	0.01%	-60948.816 us	-71.26%	FAST
I64	UNIQUE	1000000000	4	85.575 ms	0.01%	24.581 ms	0.01%	-60994.548 us	-71.28%	FAST
I64	UNIQUE	1000000000	8	85.581 ms	0.01%	24.582 ms	0.01%	-60999.083 us	-71.28%	FAST
I64	UNIQUE	1000000000	16	85.665 ms	0.01%	24.589 ms	0.05%	-61076.038 us	-71.30%	FAST
I64	UNIQUE	1000000000	32	85.673 ms	0.01%	24.597 ms	0.01%	-61075.715 us	-71.29%	FAST
I64	UNIQUE	1000000000	64	85.723 ms	0.01%	42.263 ms	0.09%	-43459.756 us	-50.70%	FAST
I64	UNIQUE	1000000000	128	86.054 ms	0.01%	58.858 ms	0.06%	-27195.454 us	-31.60%	FAST

benchmarks/bloom_filter/contains_bench.cu

include/cuco/detail/bloom_filter/bloom_filter_impl.cuh

PointKernel · 2025-02-13T21:58:39Z

benchmarks/bloom_filter/add_bench.cu

 void bloom_filter_add(nvbench::state& state,
                      nvbench::type_list<Key, Hash, Word, nvbench::enum_type<WordsPerBlock>, Dist>)
 {
+  using size_type   = std::uint32_t;


The runtime difference brought by the 100% occupancy is not worth the change IMO.

size_type = uint32_t should be the default size type for bloom_filter IMO. The only thing preventing us from setting this default in the public API is (1) we use size_t everywhere else in our library to denote sizes (I could start my usual rant about how it was a mistake for STL to choose uint64_t as the default size type). (2) A user might run into the inconvenience of a narrowing conversion if their input value is a size_t.

From an algorithmic standpoint, there is little to no need to use a wider type.

The benefit is that the kernel now achieves near-optimal occupancy, although it's not showing any significant end-to-end effect for our particular benchmark setup. However, lets say a user wants to use a hasher that needs more registers than xxhash64, then, with the 4 more registers available, the compiler has more options to optimize the code before running out of resources.

we are using size_t mainly to match STL and when possible, cudf or the upstream CCCL algorithms always use the signed integer like int32_t/int64_t as size type since it doesn't have the overflow handling.

sleeepyjack added 8 commits February 6, 2025 18:39

Eliminate IO from bloom_filter::add benchmark

9e97c67

Don't read benchmark input data from gmem

e2bb179

Increase benchmark input to reduce noise in measurements

15ecc93

Rename hash_value_type -> hash_result_type

8e3caf9

Eliminate lmem access during salt lookup

a4b91cb

Avoid error handling through __trap

a444f56

Add device bulk add

c784355

Merge remote-tracking branch 'upstream' into bf-device-bulk-add

43e913e

sleeepyjack added helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue type: improvement Improvement / enhancement to an existing function topic: bloom_filter Issues related to bloom_filter labels Feb 12, 2025

sleeepyjack self-assigned this Feb 12, 2025

sleeepyjack mentioned this pull request Feb 12, 2025

Bloom filter optimizations (5/5): Implement device bulk contains #673

Draft

sleeepyjack added 3 commits February 13, 2025 07:35

Merge remote-tracking branch 'upstream/dev' into bf-device-bulk-add

2271710

Scale grid size to SM count to reduce tail effects

61db96b

Use uint32_t size type in benchmarks to reduce register pressure

6fa0c1a

sleeepyjack marked this pull request as ready for review February 13, 2025 16:05

sleeepyjack requested a review from PointKernel as a code owner February 13, 2025 16:05

PointKernel reviewed Feb 13, 2025

View reviewed changes

PointKernel approved these changes Feb 14, 2025

View reviewed changes

sleeepyjack mentioned this pull request Feb 14, 2025

[ENHANCEMENT]: Implement cooperative device bulk operations #674

Open

Remove fallback kernel note

9d621db

sleeepyjack mentioned this pull request Feb 19, 2025

[REVIEW] fixed an index issue in bloom_filter_impl #675

Merged

sleeepyjack merged commit 459d1d6 into NVIDIA:dev Feb 20, 2025
18 checks passed

sleeepyjack deleted the bf-device-bulk-add branch February 20, 2025 00:05

sleeepyjack mentioned this pull request Feb 27, 2025

Unexpected degradation of FPR when pattern_bits=4 #682

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bloom filter optimizations (4/5): Implement device bulk add #672

Bloom filter optimizations (4/5): Implement device bulk add #672

Uh oh!

sleeepyjack commented Feb 12, 2025 •

edited

Loading

Uh oh!

sleeepyjack commented Feb 13, 2025

Uh oh!

sleeepyjack commented Feb 13, 2025

Uh oh!

PointKernel commented Feb 13, 2025 •

edited

Loading

Uh oh!

sleeepyjack commented Feb 13, 2025

Uh oh!

sleeepyjack commented Feb 13, 2025 •

edited

Loading

Uh oh!

sleeepyjack commented Feb 13, 2025

Uh oh!

Uh oh!

Uh oh!

PointKernel Feb 13, 2025

Uh oh!

sleeepyjack Feb 13, 2025 •

edited

Loading

Uh oh!

PointKernel Feb 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bloom filter optimizations (4/5): Implement device bulk add #672

Bloom filter optimizations (4/5): Implement device bulk add #672

Uh oh!

Conversation

sleeepyjack commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sleeepyjack commented Feb 13, 2025

Benchmark results:

Ref=#669 Cmp=#672

bloom_filter_add_unique_size

[0] NVIDIA H100 80GB HBM3

bloom_filter_add_unique_hash

[0] NVIDIA H100 80GB HBM3

bloom_filter_add_unique_block_dim

[0] NVIDIA H100 80GB HBM3

arrow_bloom_filter_add_unique_size

[0] NVIDIA H100 80GB HBM3

Uh oh!

sleeepyjack commented Feb 13, 2025

Uh oh!

PointKernel commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sleeepyjack commented Feb 13, 2025

Uh oh!

sleeepyjack commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sleeepyjack commented Feb 13, 2025

Ref=#669 Cmp=#672

bloom_filter_add_unique_size

[0] NVIDIA H100 80GB HBM3

bloom_filter_add_unique_hash

[0] NVIDIA H100 80GB HBM3

bloom_filter_add_unique_block_dim

[0] NVIDIA H100 80GB HBM3

arrow_bloom_filter_add_unique_size

[0] NVIDIA H100 80GB HBM3

Uh oh!

Uh oh!

Uh oh!

PointKernel Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

sleeepyjack Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PointKernel Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sleeepyjack commented Feb 12, 2025 •

edited

Loading

PointKernel commented Feb 13, 2025 •

edited

Loading

sleeepyjack commented Feb 13, 2025 •

edited

Loading

sleeepyjack Feb 13, 2025 •

edited

Loading