Skip to content

Conversation

@frankdjx
Copy link
Contributor

@frankdjx frankdjx commented Jul 31, 2020

  1. Use BitBlockCounter to speedup the performance for typical 0.01% null probability data.
  2. Enable compiler auto SIMD vectorize for no-nulls on int types. Float/Double use fmin/fmax to handle NaN which can't be vectorize by compiler.
  3. Also add test case to cover different null probability.

@github-actions
Copy link

@frankdjx
Copy link
Contributor Author

I can trigger a benchmark action once #7870 get merged.

Below is the BM data for int types on my setup:

Before:
MinMaxKernelInt8/1048576/10000          847 us          845 us          828 bytes_per_second=1.15586G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt8/1048576/0             43.9 us         43.8 us        15738 bytes_per_second=22.294G/s null_percent=0 size=1048.58k
MinMaxKernelInt16/1048576/10000         429 us          428 us         1637 bytes_per_second=2.28348G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt16/1048576/0            42.4 us         42.4 us        15878 bytes_per_second=23.0572G/s null_percent=0 size=1048.58k
MinMaxKernelInt32/1048576/10000         295 us          294 us         2383 bytes_per_second=3.31751G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt32/1048576/0            42.1 us         42.0 us        16620 bytes_per_second=23.2245G/s null_percent=0 size=1048.58k
MinMaxKernelInt64/1048576/10000         112 us          112 us         6309 bytes_per_second=8.70966G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64/1048576/0            82.2 us         82.1 us         8537 bytes_per_second=11.8992G/s null_percent=0 size=1048.58k

After(AVX2):
MinMaxKernelInt8/1048576/10000         92.9 us         92.6 us         7568 bytes_per_second=10.5421G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt8/1048576/0             31.3 us         31.2 us        21832 bytes_per_second=31.2619G/s null_percent=0 size=1048.58k
MinMaxKernelInt16/1048576/10000        60.7 us         60.5 us        11501 bytes_per_second=16.1388G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt16/1048576/0            31.5 us         31.4 us        22316 bytes_per_second=31.1085G/s null_percent=0 size=1048.58k
MinMaxKernelInt32/1048576/10000        51.0 us         50.9 us        13841 bytes_per_second=19.1853G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt32/1048576/0            31.8 us         31.7 us        22111 bytes_per_second=30.8189G/s null_percent=0 size=1048.58k
MinMaxKernelInt64/1048576/10000        61.1 us         61.0 us        11610 bytes_per_second=16.016G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64/1048576/0            54.2 us         54.1 us        12935 bytes_per_second=18.0651G/s null_percent=0 size=1048.58k

AVX512:
MinMaxKernelInt32/1048576/10000       40.9 us         40.8 us        17151 bytes_per_second=23.9207G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt32/1048576/0           25.6 us         25.6 us        26669 bytes_per_second=38.2196G/s null_percent=0 size=1048.58k
MinMaxKernelInt64/1048576/10000       34.5 us         34.4 us        20137 bytes_per_second=28.396G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64/1048576/0           23.7 us         23.7 us        25949 bytes_per_second=41.2537G/s null_percent=0 size=1048.58k

@frankdjx frankdjx marked this pull request as draft August 5, 2020 08:05
@frankdjx frankdjx marked this pull request as ready for review August 10, 2020 00:53
@frankdjx
Copy link
Contributor Author

Ping. @wesm @pitrou

Could you help to review this? Similar approach to sum kernel, use compiler to vectorise the NoNulls part, use BitBlockCounter on the 0.01% data. #7870 add the benchmark item for MinMax kernel.

Thanks.

@ursabot
Copy link

ursabot commented Aug 13, 2020

no such option: --benchmark_filter

@frankdjx
Copy link
Contributor Author

@ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark --benchmark-filter=MinMax

@frankdjx
Copy link
Contributor Author

@ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark --benchmark-filter=MinMax

Below is the results for null_percent 0.01% and 0% on https://ci.ursalabs.org/#/builders/73/builds/101

                           benchmark         baseline         contender  change %                                           counters
3     MinMaxKernelInt8/1048576/10000  812.254 MiB/sec     7.952 GiB/sec   902.442  {'run_name': 'MinMaxKernelInt8/1048576/10000',...
31   MinMaxKernelInt16/1048576/10000    1.583 GiB/sec    12.895 GiB/sec   714.512  {'run_name': 'MinMaxKernelInt16/1048576/10000'...
16   MinMaxKernelInt32/1048576/10000    3.152 GiB/sec    16.605 GiB/sec   426.876  {'run_name': 'MinMaxKernelInt32/1048576/10000'...
2        MinMaxKernelInt64/1048576/0    5.289 GiB/sec    11.092 GiB/sec   109.708  {'run_name': 'MinMaxKernelInt64/1048576/0', 'r...
14   MinMaxKernelInt64/1048576/10000    6.222 GiB/sec    10.055 GiB/sec    61.610  {'run_name': 'MinMaxKernelInt64/1048576/10000'...
1        MinMaxKernelInt32/1048576/0   18.103 GiB/sec    26.301 GiB/sec    45.282  {'run_name': 'MinMaxKernelInt32/1048576/0', 'r...
15       MinMaxKernelInt16/1048576/0   18.086 GiB/sec    26.274 GiB/sec    45.269  {'run_name': 'MinMaxKernelInt16/1048576/0', 'r...
7         MinMaxKernelInt8/1048576/0   18.112 GiB/sec    26.210 GiB/sec    44.708  {'run_name': 'MinMaxKernelInt8/1048576/0', 'ru...
26  MinMaxKernelDouble/1048576/10000    1.063 GiB/sec     1.315 GiB/sec    23.759  {'run_name': 'MinMaxKernelDouble/1048576/10000...
23   MinMaxKernelFloat/1048576/10000  551.756 MiB/sec   674.455 MiB/sec    22.238  {'run_name': 'MinMaxKernelFloat/1048576/10000'...
0       MinMaxKernelDouble/1048576/0    1.205 GiB/sec     1.332 GiB/sec    10.600  {'run_name': 'MinMaxKernelDouble/1048576/0', '...
12       MinMaxKernelFloat/1048576/0  621.824 MiB/sec   607.146 MiB/sec    -2.361  {'run_name': 'MinMaxKernelFloat/1048576/0', 'r...

@pitrou
Copy link
Member

pitrou commented Aug 25, 2020

@jianxind Sorry for the delay. Could you please rebase this PR? It looks like there are some conflicts now.

@frankdjx
Copy link
Contributor Author

@jianxind Sorry for the delay. Could you please rebase this PR? It looks like there are some conflicts now.

No problem at all. Rebased now. Thanks.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. Some comments still.

@pitrou
Copy link
Member

pitrou commented Sep 2, 2020

Passing ARROW_USER_SIMD_LEVEL=none doesn't seem to impact the results. Is something amiss?

@frankdjx
Copy link
Contributor Author

frankdjx commented Sep 2, 2020

ARROW_USER_SIMD_LEVEL=none

Below is the cmd I used, and compiler vectorise happens only on Int types.

ARROW_USER_SIMD_LEVEL=avx2 ./release/arrow-compute-aggregate-benchmark --benchmark_filter=MinMaxKernelInt64
ARROW_USER_SIMD_LEVEL=none ./release/arrow-compute-aggregate-benchmark --benchmark_filter=MinMaxKernelInt64

@pitrou
Copy link
Member

pitrou commented Sep 2, 2020

Ah, I also had -DARROW_SIMD_LEVEL=AVX2 in CMake. Without it I do see a difference.

frankdjx and others added 8 commits September 2, 2020 13:21
Signed-off-by: Frank Du <frank.du@intel.com>
Signed-off-by: Frank Du <frank.du@intel.com>
Signed-off-by: Frank Du <frank.du@intel.com>
Signed-off-by: Frank Du <frank.du@intel.com>
This reverts commit 8b5b1a6aa491e76599c1988baf9c5df5a970e672.
Signed-off-by: Frank Du <frank.du@intel.com>
Signed-off-by: Frank Du <frank.du@intel.com>
@pitrou
Copy link
Member

pitrou commented Sep 2, 2020

Rebased.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@pitrou
Copy link
Member

pitrou commented Sep 2, 2020

Test failures are unrelated, will merge.

@felipecrv
Copy link
Contributor

Ah, I also had -DARROW_SIMD_LEVEL=AVX2 in CMake. Without it I do see a difference.

I've came here after looking at the code and being confused. It sounds like there was never a need to instantiate templates with a SimdLevel parameter given that the SIMD comes from compiler auto-vectorization instead of the kernel code doing something different.

I might be wrong, in that case, I would love to have a pointer to the specialized code.

@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

@felipecrv I may be misunderstanding your question, but the SimdLevel template parameter is used as a disambiguation against ODR issues (you would get multiple times the same function with a different generated code due to different compiler options).

@felipecrv
Copy link
Contributor

@felipecrv I may be misunderstanding your question, but the SimdLevel template parameter is used as a disambiguation against ODR issues (you would get multiple times the same function with a different generated code due to different compiler options).

I understand that, but if no specialization exists for different SIMD levels, we don't need more than the SimdLevel::NONE variation of the kernel. There is no need to populate the kernel array for SimdLevel::AVX2 or AVX512 if we are not performing any sort of runtime dispatching based on runtime CPU capability checks.

@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

I understand that, but if no specialization exists for different SIMD levels, we don't need more than the SimdLevel::NONE variation of the kernel.

The source code is usually the same for all variations, but the generated code (which matters for ODR) varies thanks to different compiler options.

macro(append_runtime_avx512_src SRCS SRC)
if(ARROW_HAVE_RUNTIME_AVX512)
list(APPEND ${SRCS} ${SRC})
set_source_files_properties(${SRC} PROPERTIES SKIP_PRECOMPILE_HEADERS ON)
set_source_files_properties(${SRC} PROPERTIES COMPILE_FLAGS ${ARROW_AVX512_FLAG})
endif()
endmacro()

There is no need to populate the kernel array for SimdLevel::AVX2 or AVX512 if we are not performing any sort of runtime dispatching based on runtime CPU capability checks.

We do:

// Dispatch as the CPU feature
#if defined(ARROW_HAVE_RUNTIME_AVX512) || defined(ARROW_HAVE_RUNTIME_AVX2)
auto cpu_info = arrow::internal::CpuInfo::GetInstance();
#endif
#if defined(ARROW_HAVE_RUNTIME_AVX512)
if (cpu_info->IsSupported(arrow::internal::CpuInfo::AVX512)) {
if (kernel_matches[SimdLevel::AVX512]) {
return kernel_matches[SimdLevel::AVX512];
}
}
#endif
#if defined(ARROW_HAVE_RUNTIME_AVX2)
if (cpu_info->IsSupported(arrow::internal::CpuInfo::AVX2)) {
if (kernel_matches[SimdLevel::AVX2]) {
return kernel_matches[SimdLevel::AVX2];
}
}
#endif

@felipecrv
Copy link
Contributor

The source code is usually the same for all variations, but the generated code (which matters for ODR) varies thanks to different compiler options.

OK. Now I get it. The compiler options are source file specific and not global to the entire build.

@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

Right :-) I agree it's a bit difficult to follow.

pitrou added a commit that referenced this pull request Sep 3, 2024
…e same code in different compilation units (#43720)

### Rationale for this change

More than once I've been confused about how the `SimdLevel` template parameters on these kernel classes affect dispatching of kernels based on SIMD support detection at runtime [1] given that nothing in the code changes based on the parameters.

What matters is the compilation unit in which the templates are instantiated. Different compilation units get different compilation parameters. The SimdLevel parameters don't really affect the code that gets generated (!), they only serve as a way to avoid duplication of symbols in the compiled objects.

This PR organizes the code to make this more explicit.

[1] #7871 (comment)

### What changes are included in this PR?

 - Introduction of aggregate_basic-inl.h
 - Moving of the impls in `aggregate_basic-inl.h` to an anonymous namespace
 - Grouping of code based on the function they implement (`Sum`, `Mean`, and `MinMax`)

### Are these changes tested?

By the compilation process, existing tests, and benchmarks.

* GitHub Issue: #43719

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…rom the same code in different compilation units (apache#43720)

### Rationale for this change

More than once I've been confused about how the `SimdLevel` template parameters on these kernel classes affect dispatching of kernels based on SIMD support detection at runtime [1] given that nothing in the code changes based on the parameters.

What matters is the compilation unit in which the templates are instantiated. Different compilation units get different compilation parameters. The SimdLevel parameters don't really affect the code that gets generated (!), they only serve as a way to avoid duplication of symbols in the compiled objects.

This PR organizes the code to make this more explicit.

[1] apache#7871 (comment)

### What changes are included in this PR?

 - Introduction of aggregate_basic-inl.h
 - Moving of the impls in `aggregate_basic-inl.h` to an anonymous namespace
 - Grouping of code based on the function they implement (`Sum`, `Mean`, and `MinMax`)

### Are these changes tested?

By the compilation process, existing tests, and benchmarks.

* GitHub Issue: apache#43719

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
QuietCraftsmanship pushed a commit to QuietCraftsmanship/arrow that referenced this pull request Jul 7, 2025
…e same code in different compilation units (#43720)

### Rationale for this change

More than once I've been confused about how the `SimdLevel` template parameters on these kernel classes affect dispatching of kernels based on SIMD support detection at runtime [1] given that nothing in the code changes based on the parameters.

What matters is the compilation unit in which the templates are instantiated. Different compilation units get different compilation parameters. The SimdLevel parameters don't really affect the code that gets generated (!), they only serve as a way to avoid duplication of symbols in the compiled objects.

This PR organizes the code to make this more explicit.

[1] apache/arrow#7871 (comment)

### What changes are included in this PR?

 - Introduction of aggregate_basic-inl.h
 - Moving of the impls in `aggregate_basic-inl.h` to an anonymous namespace
 - Grouping of code based on the function they implement (`Sum`, `Mean`, and `MinMax`)

### Are these changes tested?

By the compilation process, existing tests, and benchmarks.

* GitHub Issue: #43719

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants