-
Notifications
You must be signed in to change notification settings - Fork 793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add method for async read bloom filter #4917
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, left some minor comments, but I think all this needs is a test
let buffer = self | ||
.input | ||
.0 | ||
.get_bytes(offset..offset + SBBF_HEADER_SIZE_ESTIMATE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a new bloom_filter_length that may be present and would avoid needing to guess here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, i checked the module bloom_filter
and then updated this part.
let bitset = self | ||
.input | ||
.0 | ||
.get_bytes(bitset_offset..bitset_offset + length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be ideal if we could avoid this extra roundtrip in the common case, by fetching enough data in the first call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first call is used to parse bloom_filter_length
, and the second call is used to parse bloom_filter_data
, We can reduce one call if we know the bloom_filter_length
, Thanks, I updated. Can you help review again?
@tustvold Sure, I will try to add two test cases:
|
@tustvold Can i create two test parquet files and commit to https://github.com/apache/parquet-testing/ ? |
You could, but I don't have merge rights there so it may take some time. A quicker option might be to use an existing file for 1., and to write a file to a |
@tustvold Thanks, i will use |
Would you mind take a look at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you
Which issue does this PR close?
Impl #3851
We want to filter
row_groups
in Datafusion but there is no async API for readingbloom filter
.What changes are included in this PR?
Implemented a function
get_row_group_column_bloom_filter
forParquetRecordBatchStreamBuilder
to support readingbloom filter
outside arrow.Are there any user-facing changes?
Add an function
get_row_group_column_bloom_filter
inParquetRecordBatchStreamBuilder