-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Parquet BYTE_STREAM_SPLIT
for INT32, INT64, and FIXED_LEN_BYTE_ARRAY primitive types
#6159
Merged
Merged
Changes from 16 commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
71ae965
add todos to help trace flow
etseidl dddf8a0
add support for byte_stream_split encoding for INT32 and INT64 data
etseidl 1c1af32
byte_stream_split encoding for fixed_len_byte_array
etseidl 75ea319
revert changes to Decoder and add VariableWidthByteStreamSplitDecoder
etseidl 7ce40ae
remove set_type_width as it is now unused
etseidl e7829c3
begin implementing roundtrip test
etseidl c0eb828
move test
etseidl fec8001
clean up some documentation
etseidl b9d4baf
add test of byte_stream_split with flba
etseidl f8ee320
add check for and test of mismatched sizes
etseidl 29c5119
remove type_length from Encoder and add VaribleWidthByteStreamSplitEn…
etseidl 3e598be
Merge remote-tracking branch 'origin/master' into bss
etseidl ef14c7d
fix clippy error
etseidl 6513ffb
change type of argument to new()
etseidl 3a650a7
formatting
etseidl c63a1ce
add another test
etseidl 09f467d
add variable to split/join streams for FLBA
etseidl 3fd6bc5
more informative error message
etseidl c6bb2ef
avoid buffer copies in decoder per suggestion from review
etseidl 3f6d944
add roundtrip test
etseidl 97d159b
optimized version...but clippy complains
etseidl 340eab4
clippy was right...replace loop with copy_from_slice
etseidl b2d90ce
fix test
etseidl 104b72e
optimize split_streams_variable for long type widths
etseidl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I see here we did have the test 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that tests one path, but this bypasses the BSS decoder in
encodings::decoding::byte_stream_split_decoder
. parquet-read exercises that path, so I hope to recreate that path (goes through serialized file reader) in an additional test.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this test implements the suggestion from https://github.com/apache/parquet-testing/blob/master/data/README.md#additional-types
However, when I double checked the vaues with what
pyarrow
python says they didn't seem to match 🤔I printed out the f16 column:
Here is what python told me:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pyarrow output looks like my parquet-read output (with the exception of the f16 columns). I'm not sure what happened with the
f32_col
above, but I did find those values further down in the output. Weird batching?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that the existing, non BSS columns (not changes by this PR) come back the same gives me confidence that the code is doing the right thing. I just found it straange that python seemed to give me a different result