GH-33024: [C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY encoder to Parquet writer #14293

rok · 2022-10-03T13:07:59Z

This is to add DELTA_LENGTH_BYTE_ARRAY encoder. (ARROW-17799)

Closes: [C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY encoder to Parquet writer #33024

github-actions · 2022-10-03T13:08:28Z

https://issues.apache.org/jira/browse/ARROW-17799

rok · 2022-10-03T13:08:52Z

This is not really review ready yet and needs #14191 to merge first.

rok · 2022-12-16T12:03:04Z

ARROW-17798 is merged. @pitrou could you please review the C++ part?
I've not yet figured out the Python issue - it segfaults here, but if table is 3 rows long it fails here with: OSError: Not yet implemented: DecodeArrow for DeltaLengthByteArrayDecoder.

cc @wgtmac @winval @mapleFU

mapleFU

I'm not so familiar with arrow, here are just some advices

cpp/src/parquet/encoding.cc

rok · 2022-12-19T20:43:14Z

Thanks for the review @mapleFU !

cpp/src/parquet/encoding.cc

python/pyarrow/tests/parquet/test_basic.py

rok · 2022-12-20T16:46:52Z

Thanks for the review @wgtmac, could you please do another pass of the open questions?

pitrou · 2022-12-20T16:51:52Z

cpp/src/parquet/encoding.cc

Uh? You can't blindly cast the offsets buffer of an Arrow binary array to a ByteArray*...

I presume you don't have any tests for this?

I suppose I'll be adding them :D.

FYI this function isn't hit by the tests in parquet-internals-test. One thing worth doing is attaching a debugger and setting a breakpoint in each new branch added in your changes, then verifying each breakpoint is hit. It's not perfect, but it gives you a sort of basic code coverage. :)

It's not perfect, but it gives you a sort of basic code coverage. :

Oh, cool!

The data type in the arrow::Array (String/LargeString/Binary/LargeBinary) is actually std::string_view. Casting vector of std::string_view to parquet::ByteArray makes it easy to reuse the code of Put(const T* src, int num_values) but it looks inefficient to me.

pitrou · 2022-12-20T16:53:07Z

cpp/src/parquet/encoding.cc

Why not accept other types such as String, LargeString...? See https://github.com/apache/arrow/blob/1c3b7d72c92989cea6af78f6a46938b0315d97f6/cpp/src/parquet/encoding.cc#L248 for example.

Changed. Please check if it makes sense.

wjones127

Added some suggestions for further testing :)

wjones127 · 2023-01-12T20:51:00Z

cpp/src/parquet/encoding.cc

FYI this function isn't hit by the tests in parquet-internals-test. One thing worth doing is attaching a debugger and setting a breakpoint in each new branch added in your changes, then verifying each breakpoint is hit. It's not perfect, but it gives you a sort of basic code coverage. :)

cpp/src/parquet/encoding.cc

cpp/src/parquet/encoding_test.cc

wgtmac · 2023-01-14T14:43:25Z

cpp/src/parquet/encoding.cc

The data type in the arrow::Array (String/LargeString/Binary/LargeBinary) is actually std::string_view. Casting vector of std::string_view to parquet::ByteArray makes it easy to reuse the code of Put(const T* src, int num_values) but it looks inefficient to me.

cpp/src/parquet/encoding.cc

pitrou

Thanks for the update @rok ! This is starting to look good :-)

cpp/src/parquet/encoding.cc

cpp/src/parquet/encoding_test.cc

pitrou · 2023-02-02T16:52:18Z

cpp/src/parquet/encoding_test.cc

+
+  auto CheckSeed = [&](int seed, int64_t size) {
+    ::arrow::random::RandomArrayGenerator rag(seed);
+    auto values = rag.String(size, min_length, max_length, null_probability);


We should be a bit more thorough and check that it works for all four binary types: Binary, String, LargeBinary and LargeString.

I added a roundtrip and a check encode/decode test. The issue here is that physical type decoded with EncodingTraits<ByteArrayType> will always have 32 bit index (BinaryArray/StringArray) while we might be expecting 64 bit index (LargeBinaryArray/LargeStringArray) back. Returning 32bit index kind of makes sense since that is what is encoded and doesn't cause additional computing cost. I'm adjusting with a cast in the test, but we might want to change the decoder to cast at decode-time?

Co-authored-by: Antoine Pitrou <[email protected]>

wjones127 · 2023-01-26T22:38:18Z

cpp/src/parquet/encoding_test.cc

Correct me if I'm wrong, but it doesn't seem like there is a test that looks at the actual values produced. Only that it round trips correctly. Could we add at least a simple test that verifies the values.

For example (this is somewhat pseudo-code):

Suggested change

TEST(DeltaLengthByteArrayEncodingAdHoc, Example) {

auto values = ArrayFromJSON(R"["Hello", "World", "Foobar", "ADBCEF"]");

auto lengths = ArrayFromJson(R"[5, 5, 6, 6]");

auto encoder = MakeTypedEncoder<ByteArrayType>(Encoding::DELTA_LENGTH_BYTE_ARRAY);

ASSERT_NO_THROW(encoder->Put(*values));

auto buf = encoder->FlushValues();

auto lengths_encoder = MakeTypedEncoder<ByteArrayType>(Encoding:: DELTA_BINARY_PACKED);

ASSERT_NO_THROW(lengths_encoder->Put(*lengths));

auto lengths_buf = lengths_encoder->FlushValues();

ASSERT_EQ(buf[0:lengths_buf.size()], lengths_buf);

ASSERT_EQ(buf[lengths_buf.size():], "HelloWorldFooBarABCDEF");

}

@rok do you have any tests that validates the actual encoded values? (not just that it round trips within our implementation)

I added approximately what you suggested :)

cpp/src/parquet/encoding_test.cc

rok · 2023-02-15T13:59:47Z

Does this need another round of reviews or can I slowly merge it?

wgtmac

Looking good on my side. Thanks @rok

cpp/src/parquet/encoding.cc

mapleFU

I think it's ok, waiting for it wo be merged

mapleFU · 2023-02-16T06:03:40Z

cpp/src/parquet/encoding_test.cc

No need to reopen, let's solve this comment

wjones127

Thanks for pushing this forward, Rok!

mapleFU · 2023-02-22T05:18:07Z

@wjones127 @pitrou What's the status of this patch? Should it be merged later?

rok · 2023-02-22T08:28:44Z

If there are no objections I can merge it tomorrow.

github-actions · 2023-02-23T12:49:16Z

Closes: [C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY encoder to Parquet writer #33024

Co-authored-by: Will Jones <[email protected]>

Co-authored-by: Gang Wu <[email protected]>

ursabot · 2023-02-23T20:14:45Z

Benchmark runs are scheduled for baseline = 863cdd4 and contender = 939567b. 939567b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.52% ⬆️0.0%] test-mac-arm
[Finished ⬇️7.65% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.16% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 939567b5 ec2-t3-xlarge-us-east-2
[Failed] 939567b5 test-mac-arm
[Finished] 939567b5 ursa-i9-9960x
[Finished] 939567b5 ursa-thinkcentre-m75q
[Finished] 863cdd42 ec2-t3-xlarge-us-east-2
[Failed] 863cdd42 test-mac-arm
[Finished] 863cdd42 ursa-i9-9960x
[Finished] 863cdd42 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2023-02-23T20:15:47Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

### Rationale for this change After #14293 . We have `DELTA_BYTE_LENTH` for encoding ByteArray. So, I'd like to have encoding benchmark for them. ### What changes are included in this PR? Benchmark add some cases. ### Are these changes tested? No ### Are there any user-facing changes? No * Closes: #34322 Authored-by: mwish <[email protected]> Signed-off-by: Will Jones <[email protected]>

github-actions bot added Component: Documentation Component: C++ Component: Python Component: Parquet labels Oct 3, 2022

rok force-pushed the ARROW-17799 branch from 3210793 to 70bc7f0 Compare October 6, 2022 01:28

rok mentioned this pull request Oct 7, 2022

GH-32863: [C++][Parquet] Add DELTA_BYTE_ARRAY encoder to Parquet writer #14341

Merged

rok force-pushed the ARROW-17799 branch from 70bc7f0 to d0b4380 Compare November 12, 2022 01:57

rok force-pushed the ARROW-17799 branch 5 times, most recently from a861714 to 51b9535 Compare November 24, 2022 12:21

rok force-pushed the ARROW-17799 branch 2 times, most recently from 035f1ca to 588c9bf Compare December 15, 2022 22:08

rok marked this pull request as ready for review December 15, 2022 22:09

rok force-pushed the ARROW-17799 branch from 588c9bf to 638d4d7 Compare December 15, 2022 23:53

rok requested a review from pitrou December 16, 2022 12:03

mapleFU reviewed Dec 16, 2022

View reviewed changes

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved

wgtmac reviewed Dec 20, 2022

View reviewed changes

pitrou reviewed Dec 20, 2022

View reviewed changes

asfimport mentioned this pull request Dec 23, 2022

[C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY encoder to Parquet writer #33024

Closed

wjones127 self-requested a review January 12, 2023 20:30

wjones127 reviewed Jan 12, 2023

View reviewed changes

wgtmac reviewed Jan 14, 2023

View reviewed changes

Reverting FlushValuesInternal

6213858

pitrou requested changes Feb 2, 2023

View reviewed changes

Apply suggestions from code review

b8e1dc0

Co-authored-by: Antoine Pitrou <[email protected]>

anjakefala mentioned this pull request Feb 6, 2023

[C++][Parquet] Add DELTA_BYTE_ARRAY encoder to Parquet writer #32863

Closed

wjones127 self-requested a review February 8, 2023 17:39

wjones127 reviewed Feb 8, 2023

View reviewed changes

rok force-pushed the ARROW-17799 branch from 16d21e1 to 7589c3d Compare February 10, 2023 12:26

rok requested review from pitrou and wjones127 February 10, 2023 12:47

rok force-pushed the ARROW-17799 branch from 7589c3d to bd6c3ab Compare February 10, 2023 19:42

wgtmac reviewed Feb 16, 2023

View reviewed changes

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved

mapleFU reviewed Feb 16, 2023

View reviewed changes

cpp/src/parquet/encoding_test.cc Outdated

Copy link

Member

mapleFU Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to reopen, let's solve this comment

wjones127 approved these changes Feb 17, 2023

View reviewed changes

rok changed the title ~~ARROW-17799: [C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY encoder to Parquet writer~~ GH-33024: [C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY encoder to Parquet writer Feb 23, 2023

rok and others added 3 commits February 23, 2023 13:57

Review feedback

eebc0b0

Co-authored-by: Will Jones <[email protected]>

New tests for string and binary types

44282c4

Update cpp/src/parquet/encoding.cc

b3cf11e

Co-authored-by: Gang Wu <[email protected]>

rok force-pushed the ARROW-17799 branch from 55be708 to b3cf11e Compare February 23, 2023 13:00

rok merged commit 939567b into apache:main Feb 23, 2023

rok deleted the ARROW-17799 branch February 23, 2023 13:56

mapleFU mentioned this pull request Feb 24, 2023

GH-34322: [C++][Parquet] Encoding Microbench for ByteArray #34323

Merged

chhetripradeep mentioned this pull request Apr 10, 2023

Upgrade arrow dependency to support more encodings ClickHouse/ClickHouse#44505

Closed

+TEST(DeltaLengthByteArrayEncodingAdHoc, Example) {
+  auto values = ArrayFromJSON(R"["Hello", "World", "Foobar", "ADBCEF"]");
+  auto lengths = ArrayFromJson(R"[5, 5, 6, 6]");
+  auto encoder = MakeTypedEncoder<ByteArrayType>(Encoding::DELTA_LENGTH_BYTE_ARRAY);
+  ASSERT_NO_THROW(encoder->Put(*values));
+  auto buf = encoder->FlushValues();
+  auto lengths_encoder = MakeTypedEncoder<ByteArrayType>(Encoding:: DELTA_BINARY_PACKED);
+  ASSERT_NO_THROW(lengths_encoder->Put(*lengths));
+  auto lengths_buf = lengths_encoder->FlushValues();
+  ASSERT_EQ(buf[0:lengths_buf.size()], lengths_buf);
+  ASSERT_EQ(buf[lengths_buf.size():], "HelloWorldFooBarABCDEF");
+}

GH-33024: [C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY encoder to Parquet writer #14293

GH-33024: [C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY encoder to Parquet writer #14293

Uh oh!

Conversation

rok commented Oct 3, 2022 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 3, 2022

Uh oh!

rok commented Oct 3, 2022

Uh oh!

rok commented Dec 16, 2022

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rok commented Dec 19, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rok commented Dec 20, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rok Feb 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rok commented Feb 15, 2023

Uh oh!

wgtmac left a comment

rok commented Oct 3, 2022 •

edited by github-actions bot

Loading

rok Feb 10, 2023 •

edited

Loading