Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor parquet row group pruning into a struct (use new statistics API, part 1) #10607

Merged
merged 4 commits into from
May 23, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 21, 2024

Which issue does this PR close?

Part of #10453 and #9929

Rationale for this change

This PR prepares switching the parquet reader to use the new statistics extraction API added in #10537.

To avoid a massive, I broke it into its own PR for easier review. This PR refactors the code around, but doesn't change any behavior. The next PR will change the pruning API to use the new statistics extraction API.

I also think this change makes the code easier to understand and maintain as it uses a documented struct to represent the set of RowGroups to scan rather than a Vec<usize>.

What changes are included in this PR?

  1. Add RowGroupSet to encapsulate the set of row groups which is being pruned
  2. Move pruning code into RowGroupSet methods
  3. Update code + tests to use RowGroupSet

Are these changes tested?

Yes, existing tests are updated to use the new structures

Are there any user-facing changes?

No these are all internal changes

@github-actions github-actions bot added the core Core DataFusion crate label May 21, 2024
@alamb alamb changed the title Refactor parquet row group pruning into a struct Refactor parquet row group pruning into a struct (use new statistics API, pt 1) May 21, 2024
@alamb alamb changed the title Refactor parquet row group pruning into a struct (use new statistics API, pt 1) Refactor parquet row group pruning into a struct (use new statistics API, part 1) May 21, 2024
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing with whitespace blind diff makes the changes in this PR more obvious I think

https://github.com/apache/datafusion/pull/10607/files?w=1

);
let rg_metadata = file_metadata.row_groups();
// track which row groups to actually read
let mut row_groups = RowGroupSet::new(rg_metadata.len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of this PR is to move the set of RowGroups that are still under consideration into a RowGroupSet rather than a Vec<usize>

This makes it clearer what is going on in my opinion, as well as sets up up for vectorized pruning evaluation in the next PR

///
/// Updates this set to mark row groups that should not be scanned
pub fn prune_by_range(&mut self, groups: &[RowGroupMetaData], range: &FileRange) {
for (idx, metadata) in groups.iter().enumerate() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the range filtering into its own function to make it clearer what is going on (it has nothing to do with statistics pruning)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this looks more readable

&metrics
),
vec![1]
let mut row_groups = RowGroupSet::new(2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the tests to use RowGroupSet which I think actually makes them easier to read

@@ -1196,14 +1223,60 @@ mod tests {
.await
}

// What row groups are expected to be left after pruning
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was finding vec![0] hard to understand what was being tested, so I made this structure to encapsulate it

@alamb alamb marked this pull request as ready for review May 21, 2024 21:11
@alamb
Copy link
Contributor Author

alamb commented May 22, 2024

@Ted-Jiang I wonder if you might have time to review this PR at some point?

@Ted-Jiang
Copy link
Member

@Ted-Jiang I wonder if you might have time to review this PR at some point?

Thanks ❤️ @alamb i will review this carefully later today

Copy link
Member

@Ted-Jiang Ted-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM add RowGroupSet really make this pruning code more easy to read 👍 Thanks @alamb

@alamb
Copy link
Contributor Author

alamb commented May 23, 2024

Thank you very much for the review @Ted-Jiang 🙏

@alamb alamb merged commit 656da83 into apache:main May 23, 2024
23 checks passed
alamb added a commit to alamb/datafusion that referenced this pull request May 23, 2024
…API, part 1) (apache#10607)

* Refactor parquet row group pruning into a struct

* Port tests

* improve docs

* fix msrv
/// Prune remaining row groups to only those within the specified range.
///
/// Updates this set to mark row groups that should not be scanned
pub fn prune_by_range(&mut self, groups: &[RowGroupMetaData], range: &FileRange) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being late and nit-picking here.
There is a potential for misuse of this API if the number of row groups provided to RowGroupSet does not align with the passed groups. Should we first verify that the number of row groups matches, or do you believe the row group metadata array should be included as a field in RowGroupSet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being late and nit-picking here.

No that is great. Thank you . I appreciate the comments.

should we first verify that the number of row groups matches,

Yes this is a good idea. I will do so in the next PR

or do you believe the row group metadata array should be included as a field in RowGroupSet?

I actually tried this in an earlier version of this code. One challenge I found is that prune_by_bloom_filter needs to have the builder (as it may need to potentially fetch bloom filters) which also has the metadata. So if the RowGroupSet had the metadata then there was an opportunity for inconsistency (between the metadata in the RowGroupSet and the builder)

I do think there might be an opportunity for imporvement here though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a PR to add some asserts: #10641

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is a good idea. I will do so in the next PR

Thanks for the quick update. I agree we can at least add check first.

I actually tried this in an earlier version of this code. One challenge I found is that prune_by_bloom_filter needs to have the builder (as it may need to potentially fetch bloom filters) which also has the metadata. So if the RowGroupSet had the metadata then there was an opportunity for inconsistency (between the metadata in the RowGroupSet and the builder)

I do think there might be an opportunity for improvement here though

Then, I think we can leave it as it is for now. We can revise the impl later if some issues popped up.

@alamb alamb deleted the alamb/use_new_api branch May 23, 2024 16:05
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…API, part 1) (apache#10607)

* Refactor parquet row group pruning into a struct

* Port tests

* improve docs

* fix msrv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants