GH-34322: [C++][Parquet] Encoding Microbench for ByteArray #34323

mapleFU · 2023-02-24T03:47:45Z

Rationale for this change

After #14293 . We have DELTA_BYTE_LENTH for encoding ByteArray. So, I'd like to have encoding benchmark for them.

What changes are included in this PR?

Benchmark add some cases.

Are these changes tested?

No

Are there any user-facing changes?

No

Closes: [C++][Parquet] Add benchmarks for micro benchmark for DELTA_LENGTH_BYTE_ARRAY #34322

github-actions · 2023-02-24T03:48:10Z

Closes: [C++][Parquet] Add benchmarks for micro benchmark for DELTA_LENGTH_BYTE_ARRAY #34322

github-actions · 2023-02-24T03:48:13Z

⚠️ GitHub issue #34322 has been automatically assigned in GitHub to PR creator.

mapleFU · 2023-02-24T03:48:17Z

/cc @rok

wgtmac · 2023-02-24T08:42:53Z

@mapleFU Any benchmark data?

mapleFU · 2023-02-24T09:02:48Z

I'll test it on my mac and PC tonight :)

rok · 2023-02-24T09:09:54Z

This looks pretty good.
Do you think it would make sense to add a benchmark with some nulls?

mapleFU · 2023-02-24T09:17:47Z

Spaced encoding would be added, by the way, I wonder if it's valuable to add benchmark for string with different lengths.

By the way, I think it's trickey that, DELTA_LENGTH would be powerful when compression is enabled. However, currently we don't test compression during encoding benchmark

rok · 2023-02-24T09:28:30Z

You are currently test strings of length 0 - 1024 with uniform length distribution. Maybe longer ones could be interesting, or maybe a range from say 4 - 20 or just all the of length 8. Or maybe a mix of 95% uniform distribution of 0 - 10 and 5% uniform distribution of 500 - 1000. Perhaps you can try locally if you get interesting results and only include different benchmarks.

By the way, I think it's trickey that, DELTA_LENGTH would be powerful when compression is enabled. However, currently we don't test compression during encoding benchmark

Maybe that's worth opening another issue?

mapleFU · 2023-02-24T09:34:27Z

Okay, I'd like to add spaced test, and submit a report on different machines tonight.

By the way, I'd like to test that #14293 (comment) would boost encoder performance.

mapleFU · 2023-02-24T15:39:00Z

To be honest, I found that decoding DELTA_LENGTH_BYTE_ARRAY is much more slower than I expected...

After some trivial optimization on Decoder, the speed between them are equal. The previous impl is too slow

mapleFU · 2023-02-24T16:57:09Z

On my MacOs, Release ( O2) enabled, with default memory allocator:

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_PlainEncodingByteArray                    113427 ns       111588 ns         6496 byte_array_bytes=3.43905G items_per_second=9.17665M/s
BM_DeltaBitLengthEncodingByteArray           131604 ns       127634 ns         5368 byte_array_bytes=2.84188G items_per_second=8.02297M/s
BM_PlainDecodingByteArray                      3581 ns         3499 ns       193420 byte_array_bytes=102.399G items_per_second=292.683M/s
BM_DeltaBitLengthDecodingByteArray           183808 ns       180761 ns         3913 byte_array_bytes=2.07159G items_per_second=5.66495M/s
BM_PlainDecodingSpacedByteArray                3519 ns         3513 ns       191379 byte_array_bytes=99.4606G items_per_second=291.501M/s null_percent=2
BM_DeltaBitLengthDecodingSpacedByteArray     131179 ns       130943 ns         5341 byte_array_bytes=2.77574G items_per_second=7.8202M/s null_percent=2

And after #34336 :

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_PlainEncodingByteArray                    101366 ns        85499 ns         8185 byte_array_bytes=4.33323G items_per_second=11.9767M/s
BM_DeltaBitLengthEncodingByteArray           145266 ns       130694 ns         5332 byte_array_bytes=2.82282G items_per_second=7.83507M/s
BM_PlainDecodingByteArray                      3704 ns         3577 ns       185523 byte_array_bytes=98.2179G items_per_second=286.31M/s
BM_DeltaBitLengthDecodingByteArray             2666 ns         2649 ns       266830 byte_array_bytes=141.263G items_per_second=386.623M/s
BM_PlainDecodingSpacedByteArray                3537 ns         3532 ns       202847 byte_array_bytes=105.421G items_per_second=289.898M/s null_percent=2
BM_DeltaBitLengthDecodingSpacedByteArray       2752 ns         2705 ns       259968 byte_array_bytes=135.107G items_per_second=378.521M/s null_percent=2

mapleFU · 2023-02-24T17:05:32Z

After using UnsafeAppend for continous string on my MacOS with Release(O2):

BM_PlainEncodingByteArray                    104785 ns       104771 ns         6608 byte_array_bytes=3.49835G items_per_second=9.77369M/s
BM_DeltaBitLengthEncodingByteArray           104361 ns       104324 ns         7083 byte_array_bytes=3.74982G items_per_second=9.81556M/s

mapleFU · 2023-02-27T02:52:30Z

@pitrou @rok Mind take a look?

wgtmac · 2023-02-27T05:34:08Z

cpp/src/parquet/encoding_benchmark.cc

+  // Using arrow generator to generate random data.
+  int32_t max_length = state.range(0);
+  auto array =
+      rag.String(/* size */ 1024, /* min_length */ 0, /* max_length */ max_length,


Should the size (1024 here) be adjustable? Reasonable options may include 256, 512, 4096.

I'd add different options only if there is nonlinear behavior (time complexity != O(n)).

Sometimes we need to adaptively determine a best batch size in the compute engine. Providing different batch sizes may give us better visibility on the encoding side. I suspect it will demonstrate a linear behavior as it has barriers including the block size or encoding pattern. @rok

But how many do we need to catch performance regressions?

cpp/src/parquet/encoding_benchmark.cc

rok

I'll do another pass tomorrow, but except for the CI issue it looks pretty good.

mapleFU · 2023-02-28T03:44:39Z

@wgtmac @rok

Dict Decoding for ByteArray is added, but in our current benchmark, there is no ndv. Seems I can add a ndv for benchmark dict in the future
Different batch size is added

mapleFU · 2023-02-28T03:49:14Z

Seems the batch size sometimes impact the performance a lot ...

mapleFU · 2023-02-28T15:26:50Z

I guess that's expected, after applying #34336 . They runs much more faster. Let me upload a flamegraph

mapleFU · 2023-02-28T15:36:34Z

It Spend lots of time on unpack, because it use BitReader and fetch bytes from bit-reader, which is a huge waste of time @pitrou

So it's still expected

wgtmac · 2023-02-28T16:23:30Z

IMHO, DeltaBitPackDecoder (which is used by DeltaLengthDecodingByteArray) has two possible optimizations.

The mini-block size is fixed. This can be chosen adaptively based on data distribution. It can in turn affect the decoding time: https://github.com/apache/arrow/blob/main/cpp/src/parquet/encoding.cc#L2105

  static constexpr uint32_t kValuesPerBlock = 128;
  static constexpr uint32_t kMiniBlocksPerBlock = 4;

It involves a procedure to compute deltas in the encoder and restore the values in the decoder. It can be vectorized to accelerate the encoding/decoding time.
https://github.com/apache/arrow/blob/main/cpp/src/parquet/encoding.cc#L2526

@mapleFU @pitrou @rok

mapleFU · 2023-02-28T16:32:13Z

I've test it here: #34323 (comment)

Let us do it patch by patch and make this benchmark merged first.

rok · 2023-02-28T23:17:35Z

Optimizations would be great but let's track them in separate issues and use ursabot commands (listed here https://ursalabs.org/blog/announcing-conbench/).
This looks good to me.

wgtmac · 2023-03-01T01:22:26Z

Optimizations would be great but let's track them in separate issues and use ursabot commands (listed here https://ursalabs.org/blog/announcing-conbench/). This looks good to me.

Good to know we have the command in hand!

mapleFU · 2023-03-01T02:05:57Z

Waiting for pitrou see and merge it

mapleFU · 2023-03-06T02:36:26Z

Ping @pitrou :)

mapleFU · 2023-03-07T17:46:52Z

@wjones127 @rok Seems @pitrou doesn't have spare time? Should I wait for him or wait others to take a look at this patch?

cpp/src/parquet/encoding_benchmark.cc

rok

Looks good!

cpp/src/parquet/encoding_benchmark.cc

mapleFU · 2023-03-07T18:46:37Z

On my MacOS, after change item processed to data + offset:

---------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------------------
BM_PlainEncodingByteArray/max-string-length:8/batch-size:512                       3659 ns         2745 ns       247512 bytes_per_second=1.38484G/s items_per_second=186.507M/s
BM_PlainEncodingByteArray/max-string-length:64/batch-size:512                      7606 ns         5272 ns       126288 bytes_per_second=3.2833G/s items_per_second=97.122M/s
BM_PlainEncodingByteArray/max-string-length:1024/batch-size:512                   85815 ns        62674 ns        12375 bytes_per_second=4.04123G/s items_per_second=8.1692M/s
BM_PlainEncodingByteArray/max-string-length:8/batch-size:2048                     13546 ns        12073 ns        51654 bytes_per_second=1.2649G/s items_per_second=169.637M/s
BM_PlainEncodingByteArray/max-string-length:64/batch-size:2048                    62743 ns        47676 ns        14871 bytes_per_second=1.42196G/s items_per_second=42.9562M/s
BM_PlainEncodingByteArray/max-string-length:1024/batch-size:2048                 239524 ns       208403 ns         3173 bytes_per_second=4.79258G/s items_per_second=9.82709M/s
BM_DeltaLengthEncodingByteArray/max-string-length:8/batch-size:512                 6249 ns         6093 ns       114491 bytes_per_second=638.957M/s items_per_second=84.0366M/s
BM_DeltaLengthEncodingByteArray/max-string-length:64/batch-size:512                8451 ns         7496 ns        98549 bytes_per_second=2.30901G/s items_per_second=68.3018M/s
BM_DeltaLengthEncodingByteArray/max-string-length:1024/batch-size:512             82397 ns        68114 ns         9823 bytes_per_second=3.7185G/s items_per_second=7.51683M/s
BM_DeltaLengthEncodingByteArray/max-string-length:8/batch-size:2048               26286 ns        25109 ns        27493 bytes_per_second=622.771M/s items_per_second=81.5632M/s
BM_DeltaLengthEncodingByteArray/max-string-length:64/batch-size:2048              55228 ns        52277 ns        13039 bytes_per_second=1.29682G/s items_per_second=39.176M/s
BM_DeltaLengthEncodingByteArray/max-string-length:1024/batch-size:2048           250357 ns       235479 ns         3064 bytes_per_second=4.24152G/s items_per_second=8.69715M/s
BM_PlainDecodingByteArray/max-string-length:8/batch-size:512                       1175 ns         1139 ns       621085 bytes_per_second=3.33662G/s items_per_second=449.369M/s
BM_PlainDecodingByteArray/max-string-length:64/batch-size:512                      1213 ns         1143 ns       612584 bytes_per_second=15.1485G/s items_per_second=448.102M/s
BM_PlainDecodingByteArray/max-string-length:1024/batch-size:512                    1167 ns         1140 ns       617573 bytes_per_second=222.27G/s items_per_second=449.311M/s
BM_PlainDecodingByteArray/max-string-length:8/batch-size:2048                      4436 ns         4248 ns       164415 bytes_per_second=3.59507G/s items_per_second=482.14M/s
BM_PlainDecodingByteArray/max-string-length:64/batch-size:2048                     4248 ns         4204 ns       156082 bytes_per_second=16.1272G/s items_per_second=487.192M/s
BM_PlainDecodingByteArray/max-string-length:1024/batch-size:2048                  13921 ns        13883 ns        49055 bytes_per_second=71.9418G/s items_per_second=147.515M/s
BM_DeltaLengthDecodingByteArray/max-string-length:8/batch-size:512                 1965 ns         1959 ns       356100 bytes_per_second=1.941G/s items_per_second=261.41M/s
BM_DeltaLengthDecodingByteArray/max-string-length:64/batch-size:512                5069 ns         5056 ns       139135 bytes_per_second=3.42333G/s items_per_second=101.264M/s
BM_DeltaLengthDecodingByteArray/max-string-length:1024/batch-size:512             84992 ns        82547 ns         8386 bytes_per_second=3.06834G/s items_per_second=6.20254M/s
BM_DeltaLengthDecodingByteArray/max-string-length:8/batch-size:2048                5911 ns         5891 ns       119511 bytes_per_second=2.59217G/s items_per_second=347.639M/s
BM_DeltaLengthDecodingByteArray/max-string-length:64/batch-size:2048              26904 ns        26252 ns        26404 bytes_per_second=2.58242G/s items_per_second=78.0131M/s
BM_DeltaLengthDecodingByteArray/max-string-length:1024/batch-size:2048           325519 ns       315624 ns         2253 bytes_per_second=3.16449G/s items_per_second=6.48873M/s
BM_PlainDecodingSpacedByteArray/max-string-length:8/batch-size:512                 1177 ns         1168 ns       607813 bytes_per_second=3.2343G/s items_per_second=438.381M/s null_percent=2
BM_PlainDecodingSpacedByteArray/max-string-length:64/batch-size:512                1257 ns         1187 ns       594091 bytes_per_second=14.3557G/s items_per_second=431.357M/s null_percent=2
BM_PlainDecodingSpacedByteArray/max-string-length:1024/batch-size:512              1182 ns         1174 ns       594813 bytes_per_second=212.096G/s items_per_second=436.195M/s null_percent=2
BM_PlainDecodingSpacedByteArray/max-string-length:8/batch-size:2048                4985 ns         4728 ns       149907 bytes_per_second=3.20624G/s items_per_second=433.189M/s null_percent=2
BM_PlainDecodingSpacedByteArray/max-string-length:64/batch-size:2048               4612 ns         4604 ns       150540 bytes_per_second=14.4954G/s items_per_second=444.8M/s null_percent=2
BM_PlainDecodingSpacedByteArray/max-string-length:1024/batch-size:2048            13125 ns        13113 ns        52987 bytes_per_second=74.9127G/s items_per_second=156.179M/s null_percent=2
BM_DeltaLengthDecodingSpacedByteArray/max-string-length:8/batch-size:512           1647 ns         1647 ns       418708 bytes_per_second=2.29284G/s items_per_second=310.775M/s null_percent=2
BM_DeltaLengthDecodingSpacedByteArray/max-string-length:64/batch-size:512          4548 ns         4547 ns       153156 bytes_per_second=3.74715G/s items_per_second=112.594M/s null_percent=2
BM_DeltaLengthDecodingSpacedByteArray/max-string-length:1024/batch-size:512       55928 ns        55923 ns        12517 bytes_per_second=4.45176G/s items_per_second=9.15545M/s null_percent=2
BM_DeltaLengthDecodingSpacedByteArray/max-string-length:8/batch-size:2048          5834 ns         5833 ns       120996 bytes_per_second=2.59892G/s items_per_second=351.135M/s null_percent=2
BM_DeltaLengthDecodingSpacedByteArray/max-string-length:64/batch-size:2048        17598 ns        17568 ns        39706 bytes_per_second=3.79911G/s items_per_second=116.578M/s null_percent=2
BM_DeltaLengthDecodingSpacedByteArray/max-string-length:1024/batch-size:2048     220807 ns       220793 ns         3181 bytes_per_second=4.44916G/s items_per_second=9.27566M/s null_percent=2
BM_DictDecodingByteArray/max-string-length:8/batch-size:512                        3380 ns         3380 ns       206645 bytes_per_second=1.12489G/s items_per_second=151.498M/s
BM_DictDecodingByteArray/max-string-length:64/batch-size:512                       4823 ns         4822 ns       145563 bytes_per_second=3.58935G/s items_per_second=106.175M/s
BM_DictDecodingByteArray/max-string-length:1024/batch-size:512                    32011 ns        31901 ns        21840 bytes_per_second=7.93962G/s items_per_second=16.0497M/s
BM_DictDecodingByteArray/max-string-length:8/batch-size:2048                      10664 ns        10618 ns        65184 bytes_per_second=1.43825G/s items_per_second=192.886M/s
BM_DictDecodingByteArray/max-string-length:64/batch-size:2048                     29204 ns        29138 ns        23638 bytes_per_second=2.32661G/s items_per_second=70.2853M/s
BM_DictDecodingByteArray/max-string-length:1024/batch-size:2048                  139016 ns       138991 ns         5120 bytes_per_second=7.18603G/s items_per_second=14.7348M/s

ursabot · 2023-03-08T18:35:43Z

Benchmark runs are scheduled for baseline = bd80051 and contender = 22f2980. 22f2980 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.45% ⬆️0.06%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.25% ⬆️0.28%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 22f2980e ec2-t3-xlarge-us-east-2
[Failed] 22f2980e test-mac-arm
[Finished] 22f2980e ursa-i9-9960x
[Finished] 22f2980e ursa-thinkcentre-m75q
[Finished] bd800515 ec2-t3-xlarge-us-east-2
[Failed] bd800515 test-mac-arm
[Finished] bd800515 ursa-i9-9960x
[Finished] bd800515 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2023-03-14T21:34:01Z

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

…34955) ### Rationale for this change According to #34323 . DELTA_LENGTH_BYTE_ARRAY is much more slower. So do some optimizations. ### What changes are included in this PR? Some tiny changes ### Are these changes tested? ### Are there any user-facing changes? * Closes: #34335 Authored-by: mwish <[email protected]> Signed-off-by: Will Jones <[email protected]>

…RAY (apache#34955) ### Rationale for this change According to apache#34323 . DELTA_LENGTH_BYTE_ARRAY is much more slower. So do some optimizations. ### What changes are included in this PR? Some tiny changes ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#34335 Authored-by: mwish <[email protected]> Signed-off-by: Will Jones <[email protected]>

Parquet: Introduce basic benchmark

966e877

mapleFU requested a review from wjones127 as a code owner February 24, 2023 03:47

github-actions bot added Component: C++ Component: Parquet labels Feb 24, 2023

Merge branch 'main' into parquet/benchmark-delta-length-byte

2818762

tiny benchmark spaced

5320eec

mapleFU mentioned this pull request Feb 24, 2023

GH-34335: [C++][Parquet] Optimize Decoding DELTA_LENGTH_BYTE_ARRAY #34336

Closed

add stats

86290dd

mapleFU added 2 commits February 26, 2023 23:43

Merge branch 'main' into parquet/benchmark-delta-length-byte

213d4d0

[ADD] adding range for ByteArray benchmark

cac0875

wgtmac reviewed Feb 27, 2023

View reviewed changes

rok reviewed Feb 28, 2023

View reviewed changes

cpp/src/parquet/encoding_benchmark.cc Outdated Show resolved Hide resolved

rok requested changes Feb 28, 2023

View reviewed changes

mapleFU added 2 commits February 28, 2023 10:43

Merge branch 'main' into parquet/benchmark-delta-length-byte

2679cd3

add ndv and Dict encoding for ByteArray

e0e78cd

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 28, 2023

mapleFU requested review from pitrou and removed request for wjones127 March 1, 2023 09:42

rok reviewed Mar 7, 2023

View reviewed changes

cpp/src/parquet/encoding_benchmark.cc Show resolved Hide resolved

rok approved these changes Mar 7, 2023

View reviewed changes

wjones127 approved these changes Mar 7, 2023

View reviewed changes

cpp/src/parquet/encoding_benchmark.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Mar 7, 2023

mapleFU added 2 commits March 8, 2023 02:37

Merge branch 'main' into parquet/benchmark-delta-length-byte

71b0d9a

address comment

803de33

wjones127 merged commit 22f2980 into apache:main Mar 7, 2023

mapleFU deleted the parquet/benchmark-delta-length-byte branch March 8, 2023 02:12

mapleFU mentioned this pull request Apr 7, 2023

GH-34335: [C++][Parquet] Optimize Decoding DELTA_LENGTH_BYTE_ARRAY #34955

Merged

GH-34322: [C++][Parquet] Encoding Microbench for ByteArray #34323

GH-34322: [C++][Parquet] Encoding Microbench for ByteArray #34323

Uh oh!

Conversation

mapleFU commented Feb 24, 2023 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Feb 24, 2023

Uh oh!

github-actions bot commented Feb 24, 2023

Uh oh!

mapleFU commented Feb 24, 2023

Uh oh!

wgtmac commented Feb 24, 2023

Uh oh!

mapleFU commented Feb 24, 2023

Uh oh!

rok commented Feb 24, 2023

Uh oh!

mapleFU commented Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rok commented Feb 24, 2023

Uh oh!

mapleFU commented Feb 24, 2023

Uh oh!

mapleFU commented Feb 24, 2023

Uh oh!

mapleFU commented Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU commented Feb 24, 2023

Uh oh!

mapleFU commented Feb 27, 2023

Uh oh!

wgtmac Feb 27, 2023

Choose a reason for hiding this comment

Uh oh!

rok Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

wgtmac Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

rok Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rok left a comment

Choose a reason for hiding this comment

Uh oh!

mapleFU commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU commented Feb 28, 2023

Uh oh!

mapleFU commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU commented Feb 28, 2023

Uh oh!

wgtmac commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU commented Feb 28, 2023

Uh oh!

rok commented Feb 28, 2023

Uh oh!

wgtmac commented Mar 1, 2023

Uh oh!

mapleFU commented Mar 1, 2023

Uh oh!

mapleFU commented Mar 6, 2023

Uh oh!

mapleFU commented Feb 24, 2023 •

edited by github-actions bot

Loading

mapleFU commented Feb 24, 2023 •

edited

Loading

mapleFU commented Feb 24, 2023 •

edited

Loading

mapleFU commented Feb 28, 2023 •

edited

Loading

mapleFU commented Feb 28, 2023 •

edited

Loading

wgtmac commented Feb 28, 2023 •

edited

Loading