Skip to content
This repository was archived by the owner on May 9, 2024. It is now read-only.

Fixup vectorized bitmap generator and add benchmark #582

Merged
merged 3 commits into from
Jul 18, 2023

Conversation

alexbaden
Copy link
Contributor

A few fixes necessary:

  • added required compiler flags to generate code for AVX-512 extensions to CMake
  • moved the implementation of the bitmap generators into local functions, facilitating appropriate dispatch (avx512 vs non-avx512)
  • removed unnecessary avx512f target - avx512bw is required to support _mm512_cmpneq_epi8_mask.
  • added a small microbenchmark

Closes #549

@alexbaden
Copy link
Contributor Author

Vectorized Implementation:

2023-07-14T09:51:06-07:00
Running ./BitmapGeneratorBenchmark
Run on (16 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16896 KiB (x1)
Load Average: 0.28, 0.35, 0.68
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
null_bitmap_8        57.9 ns         57.9 ns     12052210
null_bitmap_16       96.2 ns         96.1 ns      7298346
null_bitmap_32        330 ns          330 ns      2124463
null_bitmap_64        431 ns          431 ns      1628332

Non-vectorized

2023-07-14T09:53:00-07:00
Running ./BitmapGeneratorBenchmark
Run on (16 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16896 KiB (x1)
Load Average: 2.90, 1.35, 1.00
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
null_bitmap_8        3477 ns         3477 ns       199886
null_bitmap_16       3455 ns         3455 ns       202416
null_bitmap_32       4848 ns         4848 ns       144132
null_bitmap_64       4346 ns         4346 ns       161325

If I set march=haswell, I can eek out a little bit more performance of the default case:

2023-07-14T09:54:42-07:00
Running ./BitmapGeneratorBenchmark
Run on (16 X 1916.51 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16896 KiB (x1)
Load Average: 4.57, 2.05, 1.27
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
null_bitmap_8        2986 ns         2986 ns       235293
null_bitmap_16       3023 ns         3020 ns       234153
null_bitmap_32       3928 ns         3928 ns       178244
null_bitmap_64       3990 ns         3989 ns       175483

might be worth it for client CPU users.

@alexbaden alexbaden requested a review from ienkovich July 14, 2023 16:58
@alexbaden alexbaden changed the title Fixup vectorized bitmap generator and add benchmrak Fixup vectorized bitmap generator and add benchmark Jul 14, 2023
@@ -12,6 +12,10 @@ set(result_set_source_files
TargetValue.cpp
)

if (NOT WIN32)
set_source_files_properties(BitmapGenerators.cpp PROPERTIES COMPILE_FLAGS "-march=haswell -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditional AVX512 code would make HDK build unusable on non-AVX512 platforms. This is the reason to use IFUNCs in the first place - to enable fast AVX512 versions on AVX512-enabled HW and have a fallback for other platforms.

I don't know if using -march=haswell is OK for conda builds, but it allows more extensions than the default -march=nocona (which doesn't even enable SSE4).

Anyway, if we are OK with enabling more ISA extensions, then it's better to do it globally to get more benefits from those extensions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I misunderstood what those options were doing. I have everything working now with avx512 / non-avx512 overloads, but I would like to re-capture some of the alder lake performance. I dumped the AST with march=alderlake (and not) and the only differing instruction is a shlx vs sall. Apparently the later makes the uop sequence more complicated. We do lose inlining, but the call overhead appears minimal.

w/ target_clones("arch=alderlake")

Running ./Tests/BitmapGeneratorBenchmark
Run on (16 X 2496.01 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1280 KiB (x8)
  L3 Unified 18432 KiB (x1)
Load Average: 2.00, 2.05, 1.39
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
null_bitmap_8        3816 ns         3522 ns       204416
null_bitmap_16       3933 ns         3630 ns       197720
null_bitmap_32       3744 ns         3456 ns       195213
null_bitmap_64       3842 ns         3546 ns       204737

vs default:

Running ./Tests/BitmapGeneratorBenchmark
Run on (16 X 2496.01 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1280 KiB (x8)
  L3 Unified 18432 KiB (x1)
Load Average: 2.71, 2.18, 1.41
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
null_bitmap_8        5065 ns         4675 ns       155218
null_bitmap_16       4300 ns         3970 ns       167711
null_bitmap_32       4406 ns         4067 ns       174202
null_bitmap_64       4340 ns         4006 ns       173580

Using a target like x86-64-v4 generates the same assembly, but for some reason the performance doesn't increase - is it possible the wrong function is selected at runtime? I need to investigate, but target_clones makes it a little harder. I suppose alderlake is fine, especially since it's the first modern cpuid without AVX512 (tigerlake has AVX512BW).

As an aside - benchmarking on my ADL laptop is horrible due to thermal throttling, but I consistently get much better results with an alderlake specific target_clone and again, the assembly differs.

Copy link
Contributor Author

@alexbaden alexbaden Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some more digging and BMI2 appears to capture this desired behavior - the alderlake target is apparently still too new for some CI (and likely for some users compilers, sadly).

@alexbaden alexbaden force-pushed the alex/vectorized_bitmap_generator branch from 4d47c99 to b3b3b23 Compare July 18, 2023 14:30
@alexbaden alexbaden marked this pull request as ready for review July 18, 2023 14:32
@alexbaden alexbaden requested a review from ienkovich July 18, 2023 14:32
@alexbaden alexbaden force-pushed the alex/vectorized_bitmap_generator branch from b3b3b23 to 4f8b3f5 Compare July 18, 2023 14:51
This function can be selected on systems supporting BMI2, which introduces a shift instruction without modifying flags. This simplifes the uop pipeline and improves performance.
@alexbaden alexbaden merged commit 0e89ec1 into main Jul 18, 2023
@alexbaden alexbaden deleted the alex/vectorized_bitmap_generator branch July 18, 2023 19:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable vectorized bitmap generation
2 participants