-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parquet StatisticsConverter
for arrow reader
#6046
Add parquet StatisticsConverter
for arrow reader
#6046
Conversation
There is a pending PR in DataFusion that will need to be ported: https://github.com/apache/datafusion/pull/11289/files#diff-7110f4709c105a18ef74a212396444d62052179a735d148fb62470a8b157fb40 We could hold off merging this until that work is complete and I'll update this PR or I can do it as a separate PR. |
Amazing @efredine -- thank you. I am working through this but may not finish until tomorrow |
I recommend we merge this PR, and then port/fix up the struct array statistics directly in arrow-rs apache/datafusion#11289 (cc @Lordworms ) My rationale is that we are more likely to find some struct array expertise in the arrow-rs repo than the datafusion repo. |
I agree, should I port the struct related function now? |
@Lordworms I don't think it should be part of this PR which is already huge. I think the easiest thing to do is to wait until this PR is merged (which should happen soon) then open a new PR in this repository with the struct changes. |
/// underlying statistics value (stored as a parquet value) into the | ||
/// corresponding Arrow value. For example, Decimals are stored as binary in | ||
/// parquet files. | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of the port, I changed the visibility of parquet_column
from pub(crate)
to pub
because the pub(crate)
caused a failure in documentation tests. But I'm not sure this was the right way to resolve that issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having parquet_column
pub is a good change and it will be useful for others.
However, since it is more applicable than just statistics, think it should be moved to the main arrow.rs
(I will do so shortly)
// in the parquet schema | ||
return None; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How important is addressing the efficiency consideration here? For a table with many columns it would be a lot of linear searches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should file a follow on ticket to improve the situation. I think we have something functional and then we can always make it better as a follow on
Got it, I'll wait for it to be merged, thanks for your work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you so much @efredine -- this is epic (even though I know most of it was just moving code). It was quite easy to read and I found nothing in need of changes.
I know @tustvold has concerns about ignoring / not handling the ColumnOrder statistics correctly: apache/datafusion#10586 Once we sort out what the practical implementations (which is not at all clear to me know) I will make a PR to update the documentation
The one final thing I want to do before merging this PR is to make a draft PR in DataFusion to use it and verify that everything works. Doing so now
// in the parquet schema | ||
return None; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should file a follow on ticket to improve the situation. I think we have something functional and then we can always make it better as a follow on
I merged this branch up from master and moved where I plan to draft a "use upstream arrow version" PR in DataFusion, file a follow on PR for improving the performance of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also found that the benchmarks in https://github.com/apache/datafusion/blob/main/datafusion/core/benches/parquet_statistic.rs were not ported. I will do so now
FYI @marvinlanhenke |
No worries, I already pushed it to your branch (i had the code checked out anyways) |
Updates
|
Ok, I pushed some commits to this branch:
|
I think this PR is ready to go -- I am just going to try to make sure I can get apache/datafusion#11479 working with this as one final test |
StatisticsConverter
for arrow reader
Thanks again so much @efredine -- I plan to merge this later today unless anyone else would like time to review or comment |
🚀 -- thanks again |
Which issue does this PR close?
Closes #4328.
Rationale for this change
Ports StatisticsConverter implementation and tests from Data Fusion.
What changes are included in this PR?
The StatisticsConverter and all tests. It is functionally unchanged from the DataFusion implementation.
Changes:
log::debug
statements as it seemed to me these aren't used in arrow crate?arrow_err
macro.For the tests I only moved over the the code actually used by the statistics tests.
Are there any user-facing changes?
Yes, exposes the StatisticsConverter.