-
Notifications
You must be signed in to change notification settings - Fork 14
Fixup vectorized bitmap generator and add benchmark #582
Conversation
Vectorized Implementation:
Non-vectorized
If I set
might be worth it for client CPU users. |
omniscidb/ResultSet/CMakeLists.txt
Outdated
@@ -12,6 +12,10 @@ set(result_set_source_files | |||
TargetValue.cpp | |||
) | |||
|
|||
if (NOT WIN32) | |||
set_source_files_properties(BitmapGenerators.cpp PROPERTIES COMPILE_FLAGS "-march=haswell -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unconditional AVX512 code would make HDK build unusable on non-AVX512 platforms. This is the reason to use IFUNCs in the first place - to enable fast AVX512 versions on AVX512-enabled HW and have a fallback for other platforms.
I don't know if using -march=haswell
is OK for conda builds, but it allows more extensions than the default -march=nocona
(which doesn't even enable SSE4).
Anyway, if we are OK with enabling more ISA extensions, then it's better to do it globally to get more benefits from those extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I misunderstood what those options were doing. I have everything working now with avx512 / non-avx512 overloads, but I would like to re-capture some of the alder lake performance. I dumped the AST with march=alderlake
(and not) and the only differing instruction is a shlx
vs sall
. Apparently the later makes the uop sequence more complicated. We do lose inlining, but the call overhead appears minimal.
w/ target_clones("arch=alderlake")
Running ./Tests/BitmapGeneratorBenchmark
Run on (16 X 2496.01 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1280 KiB (x8)
L3 Unified 18432 KiB (x1)
Load Average: 2.00, 2.05, 1.39
---------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------
null_bitmap_8 3816 ns 3522 ns 204416
null_bitmap_16 3933 ns 3630 ns 197720
null_bitmap_32 3744 ns 3456 ns 195213
null_bitmap_64 3842 ns 3546 ns 204737
vs default
:
Running ./Tests/BitmapGeneratorBenchmark
Run on (16 X 2496.01 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1280 KiB (x8)
L3 Unified 18432 KiB (x1)
Load Average: 2.71, 2.18, 1.41
---------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------
null_bitmap_8 5065 ns 4675 ns 155218
null_bitmap_16 4300 ns 3970 ns 167711
null_bitmap_32 4406 ns 4067 ns 174202
null_bitmap_64 4340 ns 4006 ns 173580
Using a target like x86-64-v4
generates the same assembly, but for some reason the performance doesn't increase - is it possible the wrong function is selected at runtime? I need to investigate, but target_clones
makes it a little harder. I suppose alderlake
is fine, especially since it's the first modern cpuid without AVX512 (tigerlake
has AVX512BW
).
As an aside - benchmarking on my ADL laptop is horrible due to thermal throttling, but I consistently get much better results with an alderlake
specific target_clone and again, the assembly differs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some more digging and BMI2
appears to capture this desired behavior - the alderlake
target is apparently still too new for some CI (and likely for some users compilers, sadly).
4d47c99
to
b3b3b23
Compare
b3b3b23
to
4f8b3f5
Compare
This function can be selected on systems supporting BMI2, which introduces a shift instruction without modifying flags. This simplifes the uop pipeline and improves performance.
A few fixes necessary:
avx512f
target -avx512bw
is required to support_mm512_cmpneq_epi8_mask
.Closes #549