Skip to content

Conversation

@mapleFU
Copy link
Member

@mapleFU mapleFU commented Feb 24, 2023

Rationale for this change

After #14293 . We have DELTA_BYTE_LENTH for encoding ByteArray. So, I'd like to have encoding benchmark for them.

What changes are included in this PR?

Benchmark add some cases.

Are these changes tested?

No

Are there any user-facing changes?

No

@mapleFU mapleFU requested a review from wjones127 as a code owner February 24, 2023 03:47
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #34322 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Feb 24, 2023

/cc @rok

@wgtmac
Copy link
Member

wgtmac commented Feb 24, 2023

@mapleFU Any benchmark data?

@mapleFU
Copy link
Member Author

mapleFU commented Feb 24, 2023

I'll test it on my mac and PC tonight :)

@rok
Copy link
Member

rok commented Feb 24, 2023

This looks pretty good.
Do you think it would make sense to add a benchmark with some nulls?

@mapleFU
Copy link
Member Author

mapleFU commented Feb 24, 2023

Spaced encoding would be added, by the way, I wonder if it's valuable to add benchmark for string with different lengths.

By the way, I think it's trickey that, DELTA_LENGTH would be powerful when compression is enabled. However, currently we don't test compression during encoding benchmark

@rok
Copy link
Member

rok commented Feb 24, 2023

You are currently test strings of length 0 - 1024 with uniform length distribution. Maybe longer ones could be interesting, or maybe a range from say 4 - 20 or just all the of length 8. Or maybe a mix of 95% uniform distribution of 0 - 10 and 5% uniform distribution of 500 - 1000. Perhaps you can try locally if you get interesting results and only include different benchmarks.

By the way, I think it's trickey that, DELTA_LENGTH would be powerful when compression is enabled. However, currently we don't test compression during encoding benchmark

Maybe that's worth opening another issue?

@mapleFU
Copy link
Member Author

mapleFU commented Feb 24, 2023

Okay, I'd like to add spaced test, and submit a report on different machines tonight.

By the way, I'd like to test that #14293 (comment) would boost encoder performance.

@mapleFU
Copy link
Member Author

mapleFU commented Feb 24, 2023

To be honest, I found that decoding DELTA_LENGTH_BYTE_ARRAY is much more slower than I expected...

After some trivial optimization on Decoder, the speed between them are equal. The previous impl is too slow

@mapleFU
Copy link
Member Author

mapleFU commented Feb 24, 2023

On my MacOs, Release ( O2) enabled, with default memory allocator:

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_PlainEncodingByteArray                    113427 ns       111588 ns         6496 byte_array_bytes=3.43905G items_per_second=9.17665M/s
BM_DeltaBitLengthEncodingByteArray           131604 ns       127634 ns         5368 byte_array_bytes=2.84188G items_per_second=8.02297M/s
BM_PlainDecodingByteArray                      3581 ns         3499 ns       193420 byte_array_bytes=102.399G items_per_second=292.683M/s
BM_DeltaBitLengthDecodingByteArray           183808 ns       180761 ns         3913 byte_array_bytes=2.07159G items_per_second=5.66495M/s
BM_PlainDecodingSpacedByteArray                3519 ns         3513 ns       191379 byte_array_bytes=99.4606G items_per_second=291.501M/s null_percent=2
BM_DeltaBitLengthDecodingSpacedByteArray     131179 ns       130943 ns         5341 byte_array_bytes=2.77574G items_per_second=7.8202M/s null_percent=2

And after #34336 :

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_PlainEncodingByteArray                    101366 ns        85499 ns         8185 byte_array_bytes=4.33323G items_per_second=11.9767M/s
BM_DeltaBitLengthEncodingByteArray           145266 ns       130694 ns         5332 byte_array_bytes=2.82282G items_per_second=7.83507M/s
BM_PlainDecodingByteArray                      3704 ns         3577 ns       185523 byte_array_bytes=98.2179G items_per_second=286.31M/s
BM_DeltaBitLengthDecodingByteArray             2666 ns         2649 ns       266830 byte_array_bytes=141.263G items_per_second=386.623M/s
BM_PlainDecodingSpacedByteArray                3537 ns         3532 ns       202847 byte_array_bytes=105.421G items_per_second=289.898M/s null_percent=2
BM_DeltaBitLengthDecodingSpacedByteArray       2752 ns         2705 ns       259968 byte_array_bytes=135.107G items_per_second=378.521M/s null_percent=2

@mapleFU
Copy link
Member Author

mapleFU commented Feb 24, 2023

After using UnsafeAppend for continous string on my MacOS with Release(O2):

BM_PlainEncodingByteArray                    104785 ns       104771 ns         6608 byte_array_bytes=3.49835G items_per_second=9.77369M/s
BM_DeltaBitLengthEncodingByteArray           104361 ns       104324 ns         7083 byte_array_bytes=3.74982G items_per_second=9.81556M/s

@mapleFU
Copy link
Member Author

mapleFU commented Feb 27, 2023

@pitrou @rok Mind take a look?

// Using arrow generator to generate random data.
int32_t max_length = state.range(0);
auto array =
rag.String(/* size */ 1024, /* min_length */ 0, /* max_length */ max_length,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the size (1024 here) be adjustable? Reasonable options may include 256, 512, 4096.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add different options only if there is nonlinear behavior (time complexity != O(n)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes we need to adaptively determine a best batch size in the compute engine. Providing different batch sizes may give us better visibility on the encoding side. I suspect it will demonstrate a linear behavior as it has barriers including the block size or encoding pattern. @rok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how many do we need to catch performance regressions?

Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do another pass tomorrow, but except for the CI issue it looks pretty good.

@mapleFU
Copy link
Member Author

mapleFU commented Feb 28, 2023

@wgtmac @rok

  1. Dict Decoding for ByteArray is added, but in our current benchmark, there is no ndv. Seems I can add a ndv for benchmark dict in the future
  2. Different batch size is added

@mapleFU
Copy link
Member Author

mapleFU commented Feb 28, 2023

Seems the batch size sometimes impact the performance a lot ...

@mapleFU
Copy link
Member Author

mapleFU commented Feb 28, 2023

I guess that's expected, after applying #34336 . They runs much more faster. Let me upload a flamegraph

@mapleFU
Copy link
Member Author

mapleFU commented Feb 28, 2023

2685B9FE-5C6F-4959-8E87-0AEB7DF3D138

It Spend lots of time on unpack, because it use BitReader and fetch bytes from bit-reader, which is a huge waste of time @pitrou

So it's still expected

@wgtmac
Copy link
Member

wgtmac commented Feb 28, 2023

IMHO, DeltaBitPackDecoder (which is used by DeltaLengthDecodingByteArray) has two possible optimizations.

  1. The mini-block size is fixed. This can be chosen adaptively based on data distribution. It can in turn affect the decoding time: https://github.com/apache/arrow/blob/main/cpp/src/parquet/encoding.cc#L2105
  static constexpr uint32_t kValuesPerBlock = 128;
  static constexpr uint32_t kMiniBlocksPerBlock = 4;
  1. It involves a procedure to compute deltas in the encoder and restore the values in the decoder. It can be vectorized to accelerate the encoding/decoding time.
    https://github.com/apache/arrow/blob/main/cpp/src/parquet/encoding.cc#L2526

@mapleFU @pitrou @rok

@mapleFU
Copy link
Member Author

mapleFU commented Feb 28, 2023

I've test it here: #34323 (comment)

Let us do it patch by patch and make this benchmark merged first.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 28, 2023
@rok
Copy link
Member

rok commented Feb 28, 2023

Optimizations would be great but let's track them in separate issues and use ursabot commands (listed here https://ursalabs.org/blog/announcing-conbench/).
This looks good to me.

@wgtmac
Copy link
Member

wgtmac commented Mar 1, 2023

Optimizations would be great but let's track them in separate issues and use ursabot commands (listed here https://ursalabs.org/blog/announcing-conbench/). This looks good to me.

Good to know we have the command in hand!

@mapleFU
Copy link
Member Author

mapleFU commented Mar 1, 2023

Waiting for pitrou see and merge it

@mapleFU mapleFU requested review from pitrou and removed request for wjones127 March 1, 2023 09:42
@mapleFU
Copy link
Member Author

mapleFU commented Mar 6, 2023

Ping @pitrou :)

@mapleFU
Copy link
Member Author

mapleFU commented Mar 7, 2023

@wjones127 @rok Seems @pitrou doesn't have spare time? Should I wait for him or wait others to take a look at this patch?

Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Mar 7, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Mar 7, 2023

On my MacOS, after change item processed to data + offset:

---------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------------------
BM_PlainEncodingByteArray/max-string-length:8/batch-size:512                       3659 ns         2745 ns       247512 bytes_per_second=1.38484G/s items_per_second=186.507M/s
BM_PlainEncodingByteArray/max-string-length:64/batch-size:512                      7606 ns         5272 ns       126288 bytes_per_second=3.2833G/s items_per_second=97.122M/s
BM_PlainEncodingByteArray/max-string-length:1024/batch-size:512                   85815 ns        62674 ns        12375 bytes_per_second=4.04123G/s items_per_second=8.1692M/s
BM_PlainEncodingByteArray/max-string-length:8/batch-size:2048                     13546 ns        12073 ns        51654 bytes_per_second=1.2649G/s items_per_second=169.637M/s
BM_PlainEncodingByteArray/max-string-length:64/batch-size:2048                    62743 ns        47676 ns        14871 bytes_per_second=1.42196G/s items_per_second=42.9562M/s
BM_PlainEncodingByteArray/max-string-length:1024/batch-size:2048                 239524 ns       208403 ns         3173 bytes_per_second=4.79258G/s items_per_second=9.82709M/s
BM_DeltaLengthEncodingByteArray/max-string-length:8/batch-size:512                 6249 ns         6093 ns       114491 bytes_per_second=638.957M/s items_per_second=84.0366M/s
BM_DeltaLengthEncodingByteArray/max-string-length:64/batch-size:512                8451 ns         7496 ns        98549 bytes_per_second=2.30901G/s items_per_second=68.3018M/s
BM_DeltaLengthEncodingByteArray/max-string-length:1024/batch-size:512             82397 ns        68114 ns         9823 bytes_per_second=3.7185G/s items_per_second=7.51683M/s
BM_DeltaLengthEncodingByteArray/max-string-length:8/batch-size:2048               26286 ns        25109 ns        27493 bytes_per_second=622.771M/s items_per_second=81.5632M/s
BM_DeltaLengthEncodingByteArray/max-string-length:64/batch-size:2048              55228 ns        52277 ns        13039 bytes_per_second=1.29682G/s items_per_second=39.176M/s
BM_DeltaLengthEncodingByteArray/max-string-length:1024/batch-size:2048           250357 ns       235479 ns         3064 bytes_per_second=4.24152G/s items_per_second=8.69715M/s
BM_PlainDecodingByteArray/max-string-length:8/batch-size:512                       1175 ns         1139 ns       621085 bytes_per_second=3.33662G/s items_per_second=449.369M/s
BM_PlainDecodingByteArray/max-string-length:64/batch-size:512                      1213 ns         1143 ns       612584 bytes_per_second=15.1485G/s items_per_second=448.102M/s
BM_PlainDecodingByteArray/max-string-length:1024/batch-size:512                    1167 ns         1140 ns       617573 bytes_per_second=222.27G/s items_per_second=449.311M/s
BM_PlainDecodingByteArray/max-string-length:8/batch-size:2048                      4436 ns         4248 ns       164415 bytes_per_second=3.59507G/s items_per_second=482.14M/s
BM_PlainDecodingByteArray/max-string-length:64/batch-size:2048                     4248 ns         4204 ns       156082 bytes_per_second=16.1272G/s items_per_second=487.192M/s
BM_PlainDecodingByteArray/max-string-length:1024/batch-size:2048                  13921 ns        13883 ns        49055 bytes_per_second=71.9418G/s items_per_second=147.515M/s
BM_DeltaLengthDecodingByteArray/max-string-length:8/batch-size:512                 1965 ns         1959 ns       356100 bytes_per_second=1.941G/s items_per_second=261.41M/s
BM_DeltaLengthDecodingByteArray/max-string-length:64/batch-size:512                5069 ns         5056 ns       139135 bytes_per_second=3.42333G/s items_per_second=101.264M/s
BM_DeltaLengthDecodingByteArray/max-string-length:1024/batch-size:512             84992 ns        82547 ns         8386 bytes_per_second=3.06834G/s items_per_second=6.20254M/s
BM_DeltaLengthDecodingByteArray/max-string-length:8/batch-size:2048                5911 ns         5891 ns       119511 bytes_per_second=2.59217G/s items_per_second=347.639M/s
BM_DeltaLengthDecodingByteArray/max-string-length:64/batch-size:2048              26904 ns        26252 ns        26404 bytes_per_second=2.58242G/s items_per_second=78.0131M/s
BM_DeltaLengthDecodingByteArray/max-string-length:1024/batch-size:2048           325519 ns       315624 ns         2253 bytes_per_second=3.16449G/s items_per_second=6.48873M/s
BM_PlainDecodingSpacedByteArray/max-string-length:8/batch-size:512                 1177 ns         1168 ns       607813 bytes_per_second=3.2343G/s items_per_second=438.381M/s null_percent=2
BM_PlainDecodingSpacedByteArray/max-string-length:64/batch-size:512                1257 ns         1187 ns       594091 bytes_per_second=14.3557G/s items_per_second=431.357M/s null_percent=2
BM_PlainDecodingSpacedByteArray/max-string-length:1024/batch-size:512              1182 ns         1174 ns       594813 bytes_per_second=212.096G/s items_per_second=436.195M/s null_percent=2
BM_PlainDecodingSpacedByteArray/max-string-length:8/batch-size:2048                4985 ns         4728 ns       149907 bytes_per_second=3.20624G/s items_per_second=433.189M/s null_percent=2
BM_PlainDecodingSpacedByteArray/max-string-length:64/batch-size:2048               4612 ns         4604 ns       150540 bytes_per_second=14.4954G/s items_per_second=444.8M/s null_percent=2
BM_PlainDecodingSpacedByteArray/max-string-length:1024/batch-size:2048            13125 ns        13113 ns        52987 bytes_per_second=74.9127G/s items_per_second=156.179M/s null_percent=2
BM_DeltaLengthDecodingSpacedByteArray/max-string-length:8/batch-size:512           1647 ns         1647 ns       418708 bytes_per_second=2.29284G/s items_per_second=310.775M/s null_percent=2
BM_DeltaLengthDecodingSpacedByteArray/max-string-length:64/batch-size:512          4548 ns         4547 ns       153156 bytes_per_second=3.74715G/s items_per_second=112.594M/s null_percent=2
BM_DeltaLengthDecodingSpacedByteArray/max-string-length:1024/batch-size:512       55928 ns        55923 ns        12517 bytes_per_second=4.45176G/s items_per_second=9.15545M/s null_percent=2
BM_DeltaLengthDecodingSpacedByteArray/max-string-length:8/batch-size:2048          5834 ns         5833 ns       120996 bytes_per_second=2.59892G/s items_per_second=351.135M/s null_percent=2
BM_DeltaLengthDecodingSpacedByteArray/max-string-length:64/batch-size:2048        17598 ns        17568 ns        39706 bytes_per_second=3.79911G/s items_per_second=116.578M/s null_percent=2
BM_DeltaLengthDecodingSpacedByteArray/max-string-length:1024/batch-size:2048     220807 ns       220793 ns         3181 bytes_per_second=4.44916G/s items_per_second=9.27566M/s null_percent=2
BM_DictDecodingByteArray/max-string-length:8/batch-size:512                        3380 ns         3380 ns       206645 bytes_per_second=1.12489G/s items_per_second=151.498M/s
BM_DictDecodingByteArray/max-string-length:64/batch-size:512                       4823 ns         4822 ns       145563 bytes_per_second=3.58935G/s items_per_second=106.175M/s
BM_DictDecodingByteArray/max-string-length:1024/batch-size:512                    32011 ns        31901 ns        21840 bytes_per_second=7.93962G/s items_per_second=16.0497M/s
BM_DictDecodingByteArray/max-string-length:8/batch-size:2048                      10664 ns        10618 ns        65184 bytes_per_second=1.43825G/s items_per_second=192.886M/s
BM_DictDecodingByteArray/max-string-length:64/batch-size:2048                     29204 ns        29138 ns        23638 bytes_per_second=2.32661G/s items_per_second=70.2853M/s
BM_DictDecodingByteArray/max-string-length:1024/batch-size:2048                  139016 ns       138991 ns         5120 bytes_per_second=7.18603G/s items_per_second=14.7348M/s

@wjones127 wjones127 merged commit 22f2980 into apache:main Mar 7, 2023
@mapleFU mapleFU deleted the parquet/benchmark-delta-length-byte branch March 8, 2023 02:12
@ursabot
Copy link

ursabot commented Mar 8, 2023

Benchmark runs are scheduled for baseline = bd80051 and contender = 22f2980. 22f2980 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.45% ⬆️0.06%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.25% ⬆️0.28%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 22f2980e ec2-t3-xlarge-us-east-2
[Failed] 22f2980e test-mac-arm
[Finished] 22f2980e ursa-i9-9960x
[Finished] 22f2980e ursa-thinkcentre-m75q
[Finished] bd800515 ec2-t3-xlarge-us-east-2
[Failed] bd800515 test-mac-arm
[Finished] bd800515 ursa-i9-9960x
[Finished] bd800515 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Mar 14, 2023

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

wjones127 pushed a commit that referenced this pull request Apr 8, 2023
…34955)

### Rationale for this change

According to #34323 . DELTA_LENGTH_BYTE_ARRAY is much more slower. So do some optimizations.

### What changes are included in this PR?

Some tiny changes

### Are these changes tested?

### Are there any user-facing changes?

* Closes: #34335

Authored-by: mwish <[email protected]>
Signed-off-by: Will Jones <[email protected]>
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
…RAY (apache#34955)

### Rationale for this change

According to apache#34323 . DELTA_LENGTH_BYTE_ARRAY is much more slower. So do some optimizations.

### What changes are included in this PR?

Some tiny changes

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#34335

Authored-by: mwish <[email protected]>
Signed-off-by: Will Jones <[email protected]>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…RAY (apache#34955)

### Rationale for this change

According to apache#34323 . DELTA_LENGTH_BYTE_ARRAY is much more slower. So do some optimizations.

### What changes are included in this PR?

Some tiny changes

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#34335

Authored-by: mwish <[email protected]>
Signed-off-by: Will Jones <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Add benchmarks for micro benchmark for DELTA_LENGTH_BYTE_ARRAY

6 participants