Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Oct 3, 2022

This is to add DELTA_LENGTH_BYTE_ARRAY encoder. (ARROW-17799)

@github-actions
Copy link

github-actions bot commented Oct 3, 2022

@rok
Copy link
Member Author

rok commented Oct 3, 2022

This is not really review ready yet and needs #14191 to merge first.

@rok rok force-pushed the ARROW-17799 branch 5 times, most recently from a861714 to 51b9535 Compare November 24, 2022 12:21
@rok rok force-pushed the ARROW-17799 branch 2 times, most recently from 035f1ca to 588c9bf Compare December 15, 2022 22:08
@rok rok marked this pull request as ready for review December 15, 2022 22:09
@rok
Copy link
Member Author

rok commented Dec 16, 2022

ARROW-17798 is merged. @pitrou could you please review the C++ part?
I've not yet figured out the Python issue - it segfaults here, but if table is 3 rows long it fails here with: OSError: Not yet implemented: DecodeArrow for DeltaLengthByteArrayDecoder.

cc @wgtmac @winval @mapleFU

@rok rok requested a review from pitrou December 16, 2022 12:03
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so familiar with arrow, here are just some advices

@rok
Copy link
Member Author

rok commented Dec 19, 2022

Thanks for the review @mapleFU !

@rok
Copy link
Member Author

rok commented Dec 20, 2022

Thanks for the review @wgtmac, could you please do another pass of the open questions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh? You can't blindly cast the offsets buffer of an Arrow binary array to a ByteArray*...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume you don't have any tests for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I'll be adding them :D.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this function isn't hit by the tests in parquet-internals-test. One thing worth doing is attaching a debugger and setting a breakpoint in each new branch added in your changes, then verifying each breakpoint is hit. It's not perfect, but it gives you a sort of basic code coverage. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not perfect, but it gives you a sort of basic code coverage. :

Oh, cool!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data type in the arrow::Array (String/LargeString/Binary/LargeBinary) is actually std::string_view. Casting vector of std::string_view to parquet::ByteArray makes it easy to reuse the code of Put(const T* src, int num_values) but it looks inefficient to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Please check if it makes sense.

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some suggestions for further testing :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this function isn't hit by the tests in parquet-internals-test. One thing worth doing is attaching a debugger and setting a breakpoint in each new branch added in your changes, then verifying each breakpoint is hit. It's not perfect, but it gives you a sort of basic code coverage. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data type in the arrow::Array (String/LargeString/Binary/LargeBinary) is actually std::string_view. Casting vector of std::string_view to parquet::ByteArray makes it easy to reuse the code of Put(const T* src, int num_values) but it looks inefficient to me.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @rok ! This is starting to look good :-)


auto CheckSeed = [&](int seed, int64_t size) {
::arrow::random::RandomArrayGenerator rag(seed);
auto values = rag.String(size, min_length, max_length, null_probability);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be a bit more thorough and check that it works for all four binary types: Binary, String, LargeBinary and LargeString.

Copy link
Member Author

@rok rok Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a roundtrip and a check encode/decode test. The issue here is that physical type decoded with EncodingTraits<ByteArrayType> will always have 32 bit index (BinaryArray/StringArray) while we might be expecting 64 bit index (LargeBinaryArray/LargeStringArray) back. Returning 32bit index kind of makes sense since that is what is encoded and doesn't cause additional computing cost. I'm adjusting with a cast in the test, but we might want to change the decoder to cast at decode-time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but it doesn't seem like there is a test that looks at the actual values produced. Only that it round trips correctly. Could we add at least a simple test that verifies the values.

For example (this is somewhat pseudo-code):

Suggested change
TEST(DeltaLengthByteArrayEncodingAdHoc, Example) {
auto values = ArrayFromJSON(R"["Hello", "World", "Foobar", "ADBCEF"]");
auto lengths = ArrayFromJson(R"[5, 5, 6, 6]");
auto encoder = MakeTypedEncoder<ByteArrayType>(Encoding::DELTA_LENGTH_BYTE_ARRAY);
ASSERT_NO_THROW(encoder->Put(*values));
auto buf = encoder->FlushValues();
auto lengths_encoder = MakeTypedEncoder<ByteArrayType>(Encoding:: DELTA_BINARY_PACKED);
ASSERT_NO_THROW(lengths_encoder->Put(*lengths));
auto lengths_buf = lengths_encoder->FlushValues();
ASSERT_EQ(buf[0:lengths_buf.size()], lengths_buf);
ASSERT_EQ(buf[lengths_buf.size():], "HelloWorldFooBarABCDEF");
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rok do you have any tests that validates the actual encoded values? (not just that it round trips within our implementation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added approximately what you suggested :)

@rok
Copy link
Member Author

rok commented Feb 15, 2023

Does this need another round of reviews or can I slowly merge it?

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good on my side. Thanks @rok

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok, waiting for it wo be merged

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to reopen, let's solve this comment

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this forward, Rok!

@mapleFU
Copy link
Member

mapleFU commented Feb 22, 2023

@wjones127 @pitrou What's the status of this patch? Should it be merged later?

@rok
Copy link
Member Author

rok commented Feb 22, 2023

If there are no objections I can merge it tomorrow.

@rok rok changed the title ARROW-17799: [C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY encoder to Parquet writer GH-33024: [C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY encoder to Parquet writer Feb 23, 2023
@github-actions
Copy link

@rok rok merged commit 939567b into apache:main Feb 23, 2023
@rok rok deleted the ARROW-17799 branch February 23, 2023 13:56
@ursabot
Copy link

ursabot commented Feb 23, 2023

Benchmark runs are scheduled for baseline = 863cdd4 and contender = 939567b. 939567b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.52% ⬆️0.0%] test-mac-arm
[Finished ⬇️7.65% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.16% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 939567b5 ec2-t3-xlarge-us-east-2
[Failed] 939567b5 test-mac-arm
[Finished] 939567b5 ursa-i9-9960x
[Finished] 939567b5 ursa-thinkcentre-m75q
[Finished] 863cdd42 ec2-t3-xlarge-us-east-2
[Failed] 863cdd42 test-mac-arm
[Finished] 863cdd42 ursa-i9-9960x
[Finished] 863cdd42 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Feb 23, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

wjones127 pushed a commit that referenced this pull request Mar 7, 2023
### Rationale for this change

After #14293 . We have `DELTA_BYTE_LENTH` for encoding ByteArray. So, I'd like to have encoding benchmark for them.

### What changes are included in this PR?

Benchmark add some cases.

### Are these changes tested?

No

### Are there any user-facing changes?

No

* Closes: #34322

Authored-by: mwish <[email protected]>
Signed-off-by: Will Jones <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY encoder to Parquet writer

7 participants