Fixup vectorized bitmap generator and add benchmark #582

alexbaden · 2023-07-14T16:42:19Z

A few fixes necessary:

added required compiler flags to generate code for AVX-512 extensions to CMake
moved the implementation of the bitmap generators into local functions, facilitating appropriate dispatch (avx512 vs non-avx512)
removed unnecessary avx512f target - avx512bw is required to support _mm512_cmpneq_epi8_mask.
added a small microbenchmark

Closes #549

alexbaden · 2023-07-14T16:55:37Z

Vectorized Implementation:

2023-07-14T09:51:06-07:00
Running ./BitmapGeneratorBenchmark
Run on (16 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16896 KiB (x1)
Load Average: 0.28, 0.35, 0.68
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
null_bitmap_8        57.9 ns         57.9 ns     12052210
null_bitmap_16       96.2 ns         96.1 ns      7298346
null_bitmap_32        330 ns          330 ns      2124463
null_bitmap_64        431 ns          431 ns      1628332

Non-vectorized

2023-07-14T09:53:00-07:00
Running ./BitmapGeneratorBenchmark
Run on (16 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16896 KiB (x1)
Load Average: 2.90, 1.35, 1.00
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
null_bitmap_8        3477 ns         3477 ns       199886
null_bitmap_16       3455 ns         3455 ns       202416
null_bitmap_32       4848 ns         4848 ns       144132
null_bitmap_64       4346 ns         4346 ns       161325

If I set march=haswell, I can eek out a little bit more performance of the default case:

2023-07-14T09:54:42-07:00
Running ./BitmapGeneratorBenchmark
Run on (16 X 1916.51 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16896 KiB (x1)
Load Average: 4.57, 2.05, 1.27
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
null_bitmap_8        2986 ns         2986 ns       235293
null_bitmap_16       3023 ns         3020 ns       234153
null_bitmap_32       3928 ns         3928 ns       178244
null_bitmap_64       3990 ns         3989 ns       175483

might be worth it for client CPU users.

ienkovich · 2023-07-17T19:45:58Z

omniscidb/ResultSet/CMakeLists.txt

@@ -12,6 +12,10 @@ set(result_set_source_files
    TargetValue.cpp
 )

+if (NOT WIN32) 
+    set_source_files_properties(BitmapGenerators.cpp PROPERTIES COMPILE_FLAGS "-march=haswell -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw")


Unconditional AVX512 code would make HDK build unusable on non-AVX512 platforms. This is the reason to use IFUNCs in the first place - to enable fast AVX512 versions on AVX512-enabled HW and have a fallback for other platforms.

I don't know if using -march=haswell is OK for conda builds, but it allows more extensions than the default -march=nocona (which doesn't even enable SSE4).

Anyway, if we are OK with enabling more ISA extensions, then it's better to do it globally to get more benefits from those extensions.

Yes, I misunderstood what those options were doing. I have everything working now with avx512 / non-avx512 overloads, but I would like to re-capture some of the alder lake performance. I dumped the AST with march=alderlake (and not) and the only differing instruction is a shlx vs sall. Apparently the later makes the uop sequence more complicated. We do lose inlining, but the call overhead appears minimal.

w/ target_clones("arch=alderlake")

Running ./Tests/BitmapGeneratorBenchmark Run on (16 X 2496.01 MHz CPU s) CPU Caches: L1 Data 48 KiB (x8) L1 Instruction 32 KiB (x8) L2 Unified 1280 KiB (x8) L3 Unified 18432 KiB (x1) Load Average: 2.00, 2.05, 1.39 --------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------- null_bitmap_8 3816 ns 3522 ns 204416 null_bitmap_16 3933 ns 3630 ns 197720 null_bitmap_32 3744 ns 3456 ns 195213 null_bitmap_64 3842 ns 3546 ns 204737

vs default:

Running ./Tests/BitmapGeneratorBenchmark Run on (16 X 2496.01 MHz CPU s) CPU Caches: L1 Data 48 KiB (x8) L1 Instruction 32 KiB (x8) L2 Unified 1280 KiB (x8) L3 Unified 18432 KiB (x1) Load Average: 2.71, 2.18, 1.41 --------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------- null_bitmap_8 5065 ns 4675 ns 155218 null_bitmap_16 4300 ns 3970 ns 167711 null_bitmap_32 4406 ns 4067 ns 174202 null_bitmap_64 4340 ns 4006 ns 173580

Using a target like x86-64-v4 generates the same assembly, but for some reason the performance doesn't increase - is it possible the wrong function is selected at runtime? I need to investigate, but target_clones makes it a little harder. I suppose alderlake is fine, especially since it's the first modern cpuid without AVX512 (tigerlake has AVX512BW).

As an aside - benchmarking on my ADL laptop is horrible due to thermal throttling, but I consistently get much better results with an alderlake specific target_clone and again, the assembly differs.

Did some more digging and BMI2 appears to capture this desired behavior - the alderlake target is apparently still too new for some CI (and likely for some users compilers, sadly).

This function can be selected on systems supporting BMI2, which introduces a shift instruction without modifying flags. This simplifes the uop pipeline and improves performance.

alexbaden requested a review from ienkovich July 14, 2023 16:58

alexbaden changed the title ~~Fixup vectorized bitmap generator and add benchmrak~~ Fixup vectorized bitmap generator and add benchmark Jul 14, 2023

ienkovich reviewed Jul 17, 2023

View reviewed changes

Add bitmap generator benchmark

193c3dc

alexbaden force-pushed the alex/vectorized_bitmap_generator branch from 4d47c99 to b3b3b23 Compare July 18, 2023 14:30

alexbaden marked this pull request as ready for review July 18, 2023 14:32

alexbaden requested a review from ienkovich July 18, 2023 14:32

alexbaden force-pushed the alex/vectorized_bitmap_generator branch from b3b3b23 to 4f8b3f5 Compare July 18, 2023 14:51

alexbaden added 2 commits July 18, 2023 07:52

Move vectored implementation function to be local for the resolver

beb074a

Generate optimized overload of non-vectorized bitmap generator

4f8b3f5

This function can be selected on systems supporting BMI2, which introduces a shift instruction without modifying flags. This simplifes the uop pipeline and improves performance.

ienkovich approved these changes Jul 18, 2023

View reviewed changes

alexbaden merged commit 0e89ec1 into main Jul 18, 2023

alexbaden deleted the alex/vectorized_bitmap_generator branch July 18, 2023 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixup vectorized bitmap generator and add benchmark #582

Fixup vectorized bitmap generator and add benchmark #582

alexbaden commented Jul 14, 2023

alexbaden commented Jul 14, 2023

ienkovich Jul 17, 2023

alexbaden Jul 17, 2023

alexbaden Jul 18, 2023 •

edited

Loading

Fixup vectorized bitmap generator and add benchmark #582

Fixup vectorized bitmap generator and add benchmark #582

Conversation

alexbaden commented Jul 14, 2023

alexbaden commented Jul 14, 2023

ienkovich Jul 17, 2023

Choose a reason for hiding this comment

alexbaden Jul 17, 2023

Choose a reason for hiding this comment

alexbaden Jul 18, 2023 • edited Loading

Choose a reason for hiding this comment

alexbaden Jul 18, 2023 •

edited

Loading