Skip to content

Conversation

@cyb70289
Copy link
Contributor

@cyb70289 cyb70289 commented Oct 15, 2020

Improve variance kernel performance for integers by leveraging
textbook one pass algorithm and integer arithmetic.

@github-actions
Copy link

@cyb70289
Copy link
Contributor Author

cyb70289 commented Oct 15, 2020

NOTE: Benchmark PR #8407 is not merged yet. Need to manually pull that PR to evaluation performance.

Tested on Xeon Gold 5218, clang-9.

                             benchmark         baseline        contender  change %
7        VarianceKernelInt32/1048576/0    1.805 GiB/sec    6.896 GiB/sec   282.033   'null_percent': 0.0
8    VarianceKernelInt32/1048576/10000    1.751 GiB/sec    5.113 GiB/sec   191.898  'null_percent': 0.01
16     VarianceKernelInt32/1048576/100    1.139 GiB/sec    2.748 GiB/sec   141.206   'null_percent': 1.0
18      VarianceKernelInt32/1048576/10  862.260 MiB/sec    1.547 GiB/sec    83.714  'null_percent': 10.0
1        VarianceKernelInt32/1048576/2  457.956 MiB/sec  769.333 MiB/sec    67.993  'null_percent': 50.0
9        VarianceKernelInt64/1048576/0    3.596 GiB/sec    4.949 GiB/sec    37.601   'null_percent': 0.0
5    VarianceKernelInt64/1048576/10000    3.484 GiB/sec    4.627 GiB/sec    32.832  'null_percent': 0.01
22     VarianceKernelInt64/1048576/100    2.285 GiB/sec    2.787 GiB/sec    21.955   'null_percent': 1.0
3        VarianceKernelFloat/1048576/2  397.394 MiB/sec  454.485 MiB/sec    14.366  'null_percent': 50.0
10      VarianceKernelFloat/1048576/10  790.793 MiB/sec  854.667 MiB/sec     8.077  'null_percent': 10.0
0        VarianceKernelInt64/1048576/1    1.220 TiB/sec    1.261 TiB/sec     3.353  null_percent': 100.0
14       VarianceKernelInt32/1048576/1    1.222 TiB/sec    1.254 TiB/sec     2.647  null_percent': 100.0
21      VarianceKernelDouble/1048576/1    1.206 TiB/sec    1.235 TiB/sec     2.379  null_percent': 100.0
17       VarianceKernelFloat/1048576/1    1.180 TiB/sec    1.206 TiB/sec     2.208  null_percent': 100.0
4   VarianceKernelDouble/1048576/10000    3.485 GiB/sec    3.475 GiB/sec    -0.277  'null_percent': 0.01
23      VarianceKernelDouble/1048576/0    3.595 GiB/sec    3.575 GiB/sec    -0.557   'null_percent': 0.0
2      VarianceKernelFloat/1048576/100    1.133 GiB/sec    1.126 GiB/sec    -0.632   'null_percent': 1.0
13       VarianceKernelFloat/1048576/0    1.804 GiB/sec    1.792 GiB/sec    -0.643   'null_percent': 0.0
20   VarianceKernelFloat/1048576/10000    1.750 GiB/sec    1.739 GiB/sec    -0.677  'null_percent': 0.01
11    VarianceKernelDouble/1048576/100    2.287 GiB/sec    2.262 GiB/sec    -1.092   'null_percent': 1.0
19      VarianceKernelDouble/1048576/2  866.210 MiB/sec  836.046 MiB/sec    -3.482  'null_percent': 50.0
6      VarianceKernelDouble/1048576/10    1.658 GiB/sec    1.597 GiB/sec    -3.672  'null_percent': 10.0
15      VarianceKernelInt64/1048576/10    1.687 GiB/sec    1.564 GiB/sec    -7.304  'null_percent': 10.0
12       VarianceKernelInt64/1048576/2  914.036 MiB/sec  789.209 MiB/sec   -13.657  'null_percent': 50.0

@cyb70289 cyb70289 marked this pull request as draft October 15, 2020 05:48
@cyb70289
Copy link
Contributor Author

cyb70289 commented Oct 15, 2020

Turn to draft. Will add 64bit integers optimization.

@cyb70289
Copy link
Contributor Author

Added int64 optimization. Updated benchmark result. Ready for review.
Big improvement for int32. Moderate improvement for int64 with few null values.
Some drop for int64 with many null values.

@cyb70289 cyb70289 marked this pull request as ready for review October 16, 2020 07:01
Improve variance kernel performance for integers by leveraging
textbook one pass algorithm and integer arithmetic.
@pitrou
Copy link
Member

pitrou commented Oct 21, 2020

Results on an AMD Zen 2 CPU:

VarianceKernelInt32/1048576/10000         140 us          140 us         5030 bytes_per_second=6.98658G/s null_percent=0.01 size=1048.58k
VarianceKernelInt32/1048576/100           216 us          216 us         3267 bytes_per_second=4.5294G/s null_percent=1 size=1048.58k
VarianceKernelInt32/1048576/10            397 us          397 us         1763 bytes_per_second=2.45765G/s null_percent=10 size=1048.58k
VarianceKernelInt32/1048576/2             974 us          974 us          718 bytes_per_second=1026.87M/s null_percent=50 size=1048.58k
VarianceKernelInt32/1048576/1           0.816 us        0.816 us       844145 bytes_per_second=1.1684T/s null_percent=100 size=1048.58k
VarianceKernelInt32/1048576/0             130 us          130 us         5414 bytes_per_second=7.51569G/s null_percent=0 size=1048.58k

VarianceKernelInt64/1048576/10000         135 us          135 us         5174 bytes_per_second=7.22877G/s null_percent=0.01 size=1048.58k
VarianceKernelInt64/1048576/100           260 us          260 us         2682 bytes_per_second=3.7503G/s null_percent=1 size=1048.58k
VarianceKernelInt64/1048576/10            440 us          440 us         1591 bytes_per_second=2.21931G/s null_percent=10 size=1048.58k
VarianceKernelInt64/1048576/2             884 us          884 us          783 bytes_per_second=1.10507G/s null_percent=50 size=1048.58k
VarianceKernelInt64/1048576/1           0.821 us        0.821 us       840316 bytes_per_second=1.16182T/s null_percent=100 size=1048.58k
VarianceKernelInt64/1048576/0             123 us          123 us         5620 bytes_per_second=7.94262G/s null_percent=0 size=1048.58k

VarianceKernelFloat/1048576/10000         366 us          366 us         1909 bytes_per_second=2.66576G/s null_percent=0.01 size=1048.58k
VarianceKernelFloat/1048576/100           751 us          751 us          909 bytes_per_second=1.3003G/s null_percent=1 size=1048.58k
VarianceKernelFloat/1048576/10           1097 us         1097 us          637 bytes_per_second=911.712M/s null_percent=10 size=1048.58k
VarianceKernelFloat/1048576/2            1803 us         1802 us          387 bytes_per_second=554.854M/s null_percent=50 size=1048.58k
VarianceKernelFloat/1048576/1           0.817 us        0.817 us       838993 bytes_per_second=1.1679T/s null_percent=100 size=1048.58k
VarianceKernelFloat/1048576/0             346 us          346 us         2021 bytes_per_second=2.82409G/s null_percent=0 size=1048.58k

VarianceKernelDouble/1048576/10000        184 us          184 us         3751 bytes_per_second=5.30153G/s null_percent=0.01 size=1048.58k
VarianceKernelDouble/1048576/100          372 us          372 us         1869 bytes_per_second=2.62218G/s null_percent=1 size=1048.58k
VarianceKernelDouble/1048576/10           549 us          549 us         1249 bytes_per_second=1.77993G/s null_percent=10 size=1048.58k
VarianceKernelDouble/1048576/2            909 us          909 us          741 bytes_per_second=1099.92M/s null_percent=50 size=1048.58k
VarianceKernelDouble/1048576/1          0.831 us        0.831 us       831173 bytes_per_second=1.14779T/s null_percent=100 size=1048.58k
VarianceKernelDouble/1048576/0            174 us          174 us         4050 bytes_per_second=5.62431G/s null_percent=0 size=1048.58k

I'm curious why Int64 would be faster than Double. Aren't they using the same algorithm? (and Int64 goes through an additional int-to-float conversion for each value)

@cyb70289
Copy link
Contributor Author

cyb70289 commented Oct 22, 2020

I'm curious why Int64 would be faster than Double. Aren't they using the same algorithm? (and Int64 goes through an additional int-to-float conversion for each value)

There's no int-to-float conversion in Int64 summation loop (sum to Int128). It's faster than double summation.
https://quick-bench.com/q/-P9E6tgtXqnVBpVmmN6piaZHeUA

@pitrou
Copy link
Member

pitrou commented Oct 22, 2020

Ah, I hadn't noticed the SumType.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants