-
Notifications
You must be signed in to change notification settings - Fork 793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support get_row_group
in AsyncFileReader
#3851
Comments
I don't think it is that simple as the IO to fetch the bloom filter data needs to be done ahead of time. I'll have a think over the next couple of days and write up how to support this |
btw, i think should support read the special bloom filter by |
@Ted-Jiang 如果用SerializedFileReader先把parquet文件中的所有bloom filter数据都读取出来,然后调用prune_row_groups_by_bloom_filter方法做prune,PruningPredicate::prune中首先创建一个RecordBatch再通过self.predicate_expr.evaluate(&statistics_batch)方法。请教一下怎么用bloom filter数据构建RecordBatch再evaluate,或者是通过什么别的方式能够evaluate? |
Sorry,plz communicate in English |
@Ted-Jiang Perhaps we can prune rowgroups by invoking prune_row_groups_by_bloom_filter method with all bloom filter data which is read with SerializedFileReader. In PruningPredicate::prune, a RecordBatch is first created and then evaluated using the self.predicate_expr.evaluate(&statistics_batch) method. Could you please advise me how to construct a RecordBatch using bloom filter data and then evaluate it, or is there any other way to evaluate? |
@tustvold Any suggestions for this? we want to read the bloom filter in Datafusion. apache/datafusion#4512 |
The simplest option is likely to add an async function to ParquetRecordBatchStreamBuilder that fetches and decodes the bloom filters for a given column and row group, and returns this. We likely don't want to eagerly fetch and decode bloom filters as the formulation parquet uses tends to be very large, and so we should only pay for this after other forms of pruning and if we have an equality predicate on that column. |
Closed by #4917 |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When i implementing apache/datafusion#4512
I found in
AsyncFileReader
(df used) can not get the specificRowGroupReader
If i got the RowGroupReader then call
get_column_bloom_filter
will return the bloomFilterarrow-rs/parquet/src/file/reader.rs
Lines 77 to 94 in f78a9be
async version:
arrow-rs/parquet/src/arrow/async_reader/mod.rs
Lines 128 to 148 in de9f826
I think they should be consistent, Is there any other reason not supported?
Describe the solution you'd like
So i try to create a new struct
AsyncRowGroupReader
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: