-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-33024: [C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY encoder to Parquet writer #14293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is not really review ready yet and needs #14191 to merge first. |
a861714 to
51b9535
Compare
035f1ca to
588c9bf
Compare
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so familiar with arrow, here are just some advices
|
Thanks for the review @mapleFU ! |
|
Thanks for the review @wgtmac, could you please do another pass of the open questions? |
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh? You can't blindly cast the offsets buffer of an Arrow binary array to a ByteArray*...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume you don't have any tests for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I'll be adding them :D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this function isn't hit by the tests in parquet-internals-test. One thing worth doing is attaching a debugger and setting a breakpoint in each new branch added in your changes, then verifying each breakpoint is hit. It's not perfect, but it gives you a sort of basic code coverage. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not perfect, but it gives you a sort of basic code coverage. :
Oh, cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data type in the arrow::Array (String/LargeString/Binary/LargeBinary) is actually std::string_view. Casting vector of std::string_view to parquet::ByteArray makes it easy to reuse the code of Put(const T* src, int num_values) but it looks inefficient to me.
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not accept other types such as String, LargeString...? See https://github.com/apache/arrow/blob/1c3b7d72c92989cea6af78f6a46938b0315d97f6/cpp/src/parquet/encoding.cc#L248 for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. Please check if it makes sense.
wjones127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some suggestions for further testing :)
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this function isn't hit by the tests in parquet-internals-test. One thing worth doing is attaching a debugger and setting a breakpoint in each new branch added in your changes, then verifying each breakpoint is hit. It's not perfect, but it gives you a sort of basic code coverage. :)
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data type in the arrow::Array (String/LargeString/Binary/LargeBinary) is actually std::string_view. Casting vector of std::string_view to parquet::ByteArray makes it easy to reuse the code of Put(const T* src, int num_values) but it looks inefficient to me.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @rok ! This is starting to look good :-)
cpp/src/parquet/encoding_test.cc
Outdated
|
|
||
| auto CheckSeed = [&](int seed, int64_t size) { | ||
| ::arrow::random::RandomArrayGenerator rag(seed); | ||
| auto values = rag.String(size, min_length, max_length, null_probability); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be a bit more thorough and check that it works for all four binary types: Binary, String, LargeBinary and LargeString.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a roundtrip and a check encode/decode test. The issue here is that physical type decoded with EncodingTraits<ByteArrayType> will always have 32 bit index (BinaryArray/StringArray) while we might be expecting 64 bit index (LargeBinaryArray/LargeStringArray) back. Returning 32bit index kind of makes sense since that is what is encoded and doesn't cause additional computing cost. I'm adjusting with a cast in the test, but we might want to change the decoder to cast at decode-time?
Co-authored-by: Antoine Pitrou <[email protected]>
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but it doesn't seem like there is a test that looks at the actual values produced. Only that it round trips correctly. Could we add at least a simple test that verifies the values.
For example (this is somewhat pseudo-code):
| TEST(DeltaLengthByteArrayEncodingAdHoc, Example) { | |
| auto values = ArrayFromJSON(R"["Hello", "World", "Foobar", "ADBCEF"]"); | |
| auto lengths = ArrayFromJson(R"[5, 5, 6, 6]"); | |
| auto encoder = MakeTypedEncoder<ByteArrayType>(Encoding::DELTA_LENGTH_BYTE_ARRAY); | |
| ASSERT_NO_THROW(encoder->Put(*values)); | |
| auto buf = encoder->FlushValues(); | |
| auto lengths_encoder = MakeTypedEncoder<ByteArrayType>(Encoding:: DELTA_BINARY_PACKED); | |
| ASSERT_NO_THROW(lengths_encoder->Put(*lengths)); | |
| auto lengths_buf = lengths_encoder->FlushValues(); | |
| ASSERT_EQ(buf[0:lengths_buf.size()], lengths_buf); | |
| ASSERT_EQ(buf[lengths_buf.size():], "HelloWorldFooBarABCDEF"); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rok do you have any tests that validates the actual encoded values? (not just that it round trips within our implementation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added approximately what you suggested :)
|
Does this need another round of reviews or can I slowly merge it? |
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good on my side. Thanks @rok
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok, waiting for it wo be merged
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to reopen, let's solve this comment
wjones127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this forward, Rok!
|
@wjones127 @pitrou What's the status of this patch? Should it be merged later? |
|
If there are no objections I can merge it tomorrow. |
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
|
Benchmark runs are scheduled for baseline = 863cdd4 and contender = 939567b. 939567b is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
### Rationale for this change After #14293 . We have `DELTA_BYTE_LENTH` for encoding ByteArray. So, I'd like to have encoding benchmark for them. ### What changes are included in this PR? Benchmark add some cases. ### Are these changes tested? No ### Are there any user-facing changes? No * Closes: #34322 Authored-by: mwish <[email protected]> Signed-off-by: Will Jones <[email protected]>
This is to add DELTA_LENGTH_BYTE_ARRAY encoder. (ARROW-17799)