-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement directly build byte view array on top of parquet buffer #5972
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @XiangpengHao -- let me know if you think this PR is ready to merge
} | ||
|
||
let mut buffer = ViewBuffer::default(); | ||
let mut decoder = ByteViewArrayDecoderPlain::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean we could have DictionaryArray<Int32, StringView>
(as in a dictionary array whose value array is a dictionary?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean "whose value array is a string view"?
Yes, I think for optimal performance, the value buffer of the dictionary should also be in string view type to avoid double copying.
if self.validate_utf8 { | ||
check_valid_utf8(unsafe { buf.get_unchecked(start_offset..end_offset) })?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way that might be faster would be to defer the checking until after all the views were made. Then, you could take a second pass through for view validation. Maybe that would be faster than doing it inlined here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a plan for very fast utf8 vailidation, but don't want to complicate this PR here. I'll file a follow up PR that addresses the validation issue, we will hopefully see that loading StringViewArray is similar to loading BinaryViewArray.
Co-authored-by: Andrew Lamb <[email protected]>
I think this PR is good to go now @alamb |
This reverts commit 5e68870.
Thanks @XiangpengHao |
Which issue does this PR close?
This PR is not ready to review until we merged #5970 .Part of #5904 , sequel to #5968 and #5970
Rationale for this change
This PR has the real work to directly transform the parquet page buffer into a view buffer without extra copy.
To see the performance difference, you can run:
You should find that with this PR, reading BinaryViewArray is faster than reading BinaryArray -- a milestone from making StringViewArray faster than StringArray.
When this (set of) PR is merged, the last piece is to make utf8 validation fast, so that string view can maintain the advantage.
What changes are included in this PR?
This PR only includes decoding plain data for ease of review. Supporting for RLE/dictionary will be filed soon.
Are there any user-facing changes?