Refine statistics extraction API and tests #118
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note this PR targets another PR apache#10537 from @NGA-TRAN rather than main
This PR proposes a different API than what is described on apache#10453, based on my working through the example in apache#10549. I am sorry I should have done this first.
The major differences is that the min/max extraction is not done in a single call, but only on demand which matches what the actual pruning predicate needs. I also think the new API also has a natural way to extract column index statistics.
I actually found there is a version of this API and tests for it already here: https://github.com/apache/datafusion/blob/d2fb05ed5ba71fd0f1d440baca12897413c2a8af/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L214-L922
It turns out there was enough code to actually hook it up using the existing (production) statistics extraction code, so I did that as well. This is far from efficient, but it is a start.
If we like this API, perhaps we can complete the test coverage and then make it more efficient
However, it is not currently exposed publically, and I don't think the tests are great (as they aren't a public API), and the performance is not great.