Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine statistics extraction API and tests #118

Merged
merged 3 commits into from
May 17, 2024

Conversation

alamb
Copy link

@alamb alamb commented May 17, 2024

Note this PR targets another PR apache#10537 from @NGA-TRAN rather than main

This PR proposes a different API than what is described on apache#10453, based on my working through the example in apache#10549. I am sorry I should have done this first.

The major differences is that the min/max extraction is not done in a single call, but only on demand which matches what the actual pruning predicate needs. I also think the new API also has a natural way to extract column index statistics.

I actually found there is a version of this API and tests for it already here: https://github.com/apache/datafusion/blob/d2fb05ed5ba71fd0f1d440baca12897413c2a8af/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L214-L922

It turns out there was enough code to actually hook it up using the existing (production) statistics extraction code, so I did that as well. This is far from efficient, but it is a start.

If we like this API, perhaps we can complete the test coverage and then make it more efficient

However, it is not currently exposed publically, and I don't think the tests are great (as they aren't a public API), and the performance is not great.

@github-actions github-actions bot added the core label May 17, 2024
@NGA-TRAN NGA-TRAN merged commit 71ca4b1 into NGA-TRAN:ntran/rg_stats_api May 17, 2024
1 check passed
@alamb alamb deleted the alamb/stats_api_refine branch May 17, 2024 16:14
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants