Skip to content

Conversation

@sleeepyjack
Copy link
Collaborator

@sleeepyjack sleeepyjack commented Feb 12, 2025

This PR is part 2/5 of the Bloom filter optimization project and must be merged in the correct order.

The arrow_filter_policy requires the lookup of a static salt from an array of length 8.
The current implementation lead to excessive local memory traffic induced by this lookup.
This PR fixes this by replacing the array lookup with an inline 8-way switch-case.

@sleeepyjack sleeepyjack added helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue topic: bloom_filter Issues related to bloom_filter labels Feb 12, 2025
@sleeepyjack sleeepyjack self-assigned this Feb 12, 2025
Copy link
Contributor

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Do we have any benchmarks that show the effect of this optimization?

@sleeepyjack
Copy link
Collaborator Author

sleeepyjack commented Feb 13, 2025

@mhaseeb123 here they are

Benchmark results:

Ref=#669 Cmp=#670

arrow_bloom_filter_add_unique_size

[0] NVIDIA H100 80GB HBM3

Key Distribution NumInputs FilterSizeMB Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I64 UNIQUE 1000000000 1 85.883 ms 0.01% 43.649 ms 0.00% -42233.580 us -49.18% FAST
I64 UNIQUE 1000000000 2 85.530 ms 0.01% 43.650 ms 0.00% -41879.809 us -48.97% FAST
I64 UNIQUE 1000000000 4 85.575 ms 0.01% 43.650 ms 0.00% -41924.872 us -48.99% FAST
I64 UNIQUE 1000000000 8 85.581 ms 0.01% 43.651 ms 0.00% -41929.564 us -48.99% FAST
I64 UNIQUE 1000000000 16 85.665 ms 0.01% 43.653 ms 0.00% -42012.336 us -49.04% FAST
I64 UNIQUE 1000000000 32 85.673 ms 0.01% 43.659 ms 0.00% -42013.794 us -49.04% FAST
I64 UNIQUE 1000000000 64 85.723 ms 0.01% 51.861 ms 0.02% -33862.236 us -39.50% FAST
I64 UNIQUE 1000000000 128 86.054 ms 0.01% 64.295 ms 0.00% -21759.247 us -25.29% FAST

arrow_bloom_filter_contains_unique_size

[0] NVIDIA H100 80GB HBM3

Key Distribution NumInputs FilterSizeMB Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I64 UNIQUE 1000000000 1 11.330 ms 0.03% 11.373 ms 0.21% 43.710 us 0.39% SLOW
I64 UNIQUE 1000000000 2 13.013 ms 0.14% 13.013 ms 0.07% 0.080 us 0.00% SAME
I64 UNIQUE 1000000000 4 13.883 ms 0.13% 13.875 ms 0.02% -7.219 us -0.05% FAST
I64 UNIQUE 1000000000 8 14.155 ms 0.24% 14.178 ms 0.02% 22.811 us 0.16% SLOW
I64 UNIQUE 1000000000 16 14.326 ms 0.15% 14.335 ms 0.03% 8.940 us 0.06% SLOW
I64 UNIQUE 1000000000 32 17.242 ms 0.11% 17.237 ms 0.06% -5.357 us -0.03% SAME
I64 UNIQUE 1000000000 64 22.634 ms 0.20% 22.657 ms 0.14% 22.970 us 0.10% SAME
I64 UNIQUE 1000000000 128 32.293 ms 0.15% 32.301 ms 0.06% 7.208 us 0.02% SAME

@sleeepyjack sleeepyjack marked this pull request as ready for review February 13, 2025 00:00
@sleeepyjack sleeepyjack merged commit e1203be into NVIDIA:dev Feb 13, 2025
18 checks passed
@sleeepyjack sleeepyjack deleted the bf-arrow-opt branch February 13, 2025 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

helps: rapids Helps or needed by RAPIDS topic: bloom_filter Issues related to bloom_filter topic: performance Performance related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants