Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune Parquet RowGroup in a single call to PruningPredicate::prune, update StatisticsExtractor API #10802

Merged
merged 3 commits into from
Jun 8, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 5, 2024

Which issue does this PR close?

Part of #10453 and #9929

Follow on to #10607

Rationale for this change

The primary benefit of this PR is to start using the new API introduced in #10537 in the ParquetExec path. I plan a follow on project to use the same basic API to extract and prune pages within row groups.

The current ParquetExec prunes one row group at a time by creating 1 row ArrayRefs for each min/max/count in required. It would be better to create a single array with the data for multiple row groups and do a single call the vectorized pruning that PruningPredicate does.

We recently made a similar change in InfluxDB IOx and saw a significant performance improvement for queries that accessed many row groups

I expect this to be a performance improvement, but I am not sure it will be measurable unless there are an extremely large number of row groups in a file.

What changes are included in this PR?

  1. Call PruningPredicate::prune once per file (rather than once per row group)
  2. Switch to use the StatisticsExtractor API introduced from feat: API for collecting statistics/index for metadata of a parquet file + tests #10537
  3. Update the StatisticsExtractor API so it extracts a specified set of row groups rather than all of them

The changes to the StatisticsExtractor API are to return min/max statistics by different functions rather than enum. This will allow the same basic API to extract min/max statistics for pages as well (page_mins(), page_maxs(), page_counts()) etc.

Are these changes tested?

Covered by existing CI tests.

I ran some benchmark tests and this doesn't seem to meaningfully change

Are there any user-facing changes?

The StatisticsExtractor API has changed, but since this API has not yet been released, this is not strictly a breaking API change

@github-actions github-actions bot added the core Core DataFusion crate label Jun 5, 2024
@alamb alamb changed the title Optimize Parquet RowGroup pruning, update StatisticsExtractor API Prune RowGroup in a single call to PruningPredicate::prune, update StatisticsExtractor API Jun 5, 2024
@alamb alamb changed the title Prune RowGroup in a single call to PruningPredicate::prune, update StatisticsExtractor API Prune Parquet RowGroup in a single call to PruningPredicate::prune, update StatisticsExtractor API Jun 5, 2024
@alamb alamb marked this pull request as ready for review June 5, 2024 13:14

// Extract the min/max values for each row group from the statistics
let row_counts = StatisticsConverter::row_counts(reader.metadata())?;
let value_column_mins = StatisticsConverter::try_new(
let converter = StatisticsConverter::try_new(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty good example of how the statistics API changed. FYI @NGA-TRAN

Ok(values) => {
// NB: false means don't scan row group
if !values[0] {
// Indexes of row groups still to scan
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the change to prune all row groups with one call to PruningPredicate::prune rather than one call per row group

let iter = metadatas
.into_iter()
.map(|x| x.column(parquet_index).statistics());
max_statistics(data_type, iter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The min_statistcs and max_statistics changes in another PR could still be used here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed -- you are exactly correct. I purposely didn't change min_statistics and max_statistics as I knew you were working on them already

.extract(reader.metadata())?;
reader.parquet_schema(),
)?;
let row_counts = StatisticsConverter::row_group_row_counts(row_groups.iter())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an user-facing change, should be ok at this stage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is a user facing change, but we haven't released a version of datafusion yet that had StatisticsConverter (it was only added a week or two ago) so this will not be an API change for anyone using released versions

.unwrap();

let _ = StatisticsConverter::row_counts(reader.metadata()).unwrap();
let _ = converter.row_group_mins(row_groups.iter()).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more clear than using enum IMO :)

Copy link
Contributor

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No nice. Thanks Andrew

)?;
let row_counts = StatisticsConverter::row_group_row_counts(row_groups.iter())?;
let value_column_mins = converter.row_group_mins(row_groups.iter())?;
let value_column_maxes = converter.row_group_maxes(row_groups.iter())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

/// Null Count, returned as a [`UInt64Array`])
NullCount,
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we do not need this. We store each min/max/row_count as an array instead :thus

@alamb alamb mentioned this pull request Jun 7, 2024
4 tasks
Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I give it a quick skim and it looks good in general 👍

@alamb
Copy link
Contributor Author

alamb commented Jun 8, 2024

Thank you @NGA-TRAN @xinlifoobar and @waynexia for the reviews

@alamb alamb merged commit 90f89e0 into apache:main Jun 8, 2024
11 checks passed
@alamb alamb deleted the alamb/vectorized_stats branch June 8, 2024 12:23
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants