Prune Parquet RowGroup in a single call to `PruningPredicate::prune`, update StatisticsExtractor API #10802

alamb · 2024-06-05T12:55:56Z

Which issue does this PR close?

Follow on to #10607

Rationale for this change

The primary benefit of this PR is to start using the new API introduced in #10537 in the ParquetExec path. I plan a follow on project to use the same basic API to extract and prune pages within row groups.

The current ParquetExec prunes one row group at a time by creating 1 row ArrayRefs for each min/max/count in required. It would be better to create a single array with the data for multiple row groups and do a single call the vectorized pruning that PruningPredicate does.

We recently made a similar change in InfluxDB IOx and saw a significant performance improvement for queries that accessed many row groups

I expect this to be a performance improvement, but I am not sure it will be measurable unless there are an extremely large number of row groups in a file.

What changes are included in this PR?

Call PruningPredicate::prune once per file (rather than once per row group)
Switch to use the StatisticsExtractor API introduced from feat: API for collecting statistics/index for metadata of a parquet file + tests #10537
Update the StatisticsExtractor API so it extracts a specified set of row groups rather than all of them

The changes to the StatisticsExtractor API are to return min/max statistics by different functions rather than enum. This will allow the same basic API to extract min/max statistics for pages as well (page_mins(), page_maxs(), page_counts()) etc.

Are these changes tested?

Covered by existing CI tests.

I ran some benchmark tests and this doesn't seem to meaningfully change

Are there any user-facing changes?

The StatisticsExtractor API has changed, but since this API has not yet been released, this is not strictly a breaking API change

alamb · 2024-06-05T12:56:22Z

datafusion-examples/examples/parquet_index.rs


        // Extract the min/max values for each row group from the statistics
-        let row_counts = StatisticsConverter::row_counts(reader.metadata())?;
-        let value_column_mins = StatisticsConverter::try_new(
+        let converter = StatisticsConverter::try_new(


This is a pretty good example of how the statistics API changed. FYI @NGA-TRAN

alamb · 2024-06-05T12:57:00Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

-                Ok(values) => {
-                    // NB: false means don't scan row group
-                    if !values[0] {
+        // Indexes of row groups still to scan


Here is the change to prune all row groups with one call to PruningPredicate::prune rather than one call per row group

xinlifoobar · 2024-06-05T13:50:51Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

+        let iter = metadatas
+            .into_iter()
+            .map(|x| x.column(parquet_index).statistics());
+        max_statistics(data_type, iter)


The min_statistcs and max_statistics changes in another PR could still be used here...

Yes, indeed -- you are exactly correct. I purposely didn't change min_statistics and max_statistics as I knew you were working on them already

xinlifoobar · 2024-06-05T13:52:31Z

datafusion-examples/examples/parquet_index.rs

-        .extract(reader.metadata())?;
+            reader.parquet_schema(),
+        )?;
+        let row_counts = StatisticsConverter::row_group_row_counts(row_groups.iter())?;


This looks like an user-facing change, should be ok at this stage?

yes, it is a user facing change, but we haven't released a version of datafusion yet that had StatisticsConverter (it was only added a week or two ago) so this will not be an API change for anyone using released versions

xinlifoobar · 2024-06-05T13:53:42Z

datafusion/core/benches/parquet_statistic.rs

                    .unwrap();

-                    let _ = StatisticsConverter::row_counts(reader.metadata()).unwrap();
+                    let _ = converter.row_group_mins(row_groups.iter()).unwrap();


This is more clear than using enum IMO :)

NGA-TRAN

No nice. Thanks Andrew

NGA-TRAN · 2024-06-05T17:30:51Z

datafusion-examples/examples/parquet_index.rs

+        )?;
+        let row_counts = StatisticsConverter::row_group_row_counts(row_groups.iter())?;
+        let value_column_mins = converter.row_group_mins(row_groups.iter())?;
+        let value_column_maxes = converter.row_group_maxes(row_groups.iter())?;


NGA-TRAN · 2024-06-05T19:45:49Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

-    /// Null Count, returned as a [`UInt64Array`])
-    NullCount,
-}
-


I agree we do not need this. We store each min/max/row_count as an array instead :thus

waynexia

I give it a quick skim and it looks good in general 👍

alamb · 2024-06-08T12:22:27Z

Thank you @NGA-TRAN @xinlifoobar and @waynexia for the reviews

…ache#10802)

Optimize Parquet RowGroup pruning, update StatisticsExtractor API

e1d3905

github-actions bot added the core Core DataFusion crate label Jun 5, 2024

alamb changed the title ~~Optimize Parquet RowGroup pruning, update StatisticsExtractor API~~ Prune RowGroup in a single call to PruningPredicate::prune, update StatisticsExtractor API Jun 5, 2024

alamb changed the title ~~Prune RowGroup in a single call to PruningPredicate::prune, update StatisticsExtractor API~~ Prune Parquet RowGroup in a single call to PruningPredicate::prune, update StatisticsExtractor API Jun 5, 2024

alamb marked this pull request as ready for review June 5, 2024 13:14

alamb commented Jun 5, 2024

View reviewed changes

alamb mentioned this pull request Jun 5, 2024

Efficiently and correctly Extract Page Index statistics into ArrayRefs #10806

Closed

xinlifoobar reviewed Jun 5, 2024

View reviewed changes

NGA-TRAN approved these changes Jun 5, 2024

View reviewed changes

xinlifoobar approved these changes Jun 6, 2024

View reviewed changes

Merge remote-tracking branch 'apache/main' into alamb/vectorized_stats

f55c45b

alamb mentioned this pull request Jun 7, 2024

Release DataFusion 39.0.0 #10517

Closed

4 tasks

waynexia approved these changes Jun 7, 2024

View reviewed changes

Merge remote-tracking branch 'apache/main' into alamb/vectorized_stats

a71c197

alamb merged commit 90f89e0 into apache:main Jun 8, 2024
11 checks passed

alamb deleted the alamb/vectorized_stats branch June 8, 2024 12:23

alamb mentioned this pull request Jun 11, 2024

DataFusion weekly project plan (Andrew Lamb) - June 10, 2024 #10869

Closed

7 tasks

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024

Optimize Parquet RowGroup pruning, update StatisticsExtractor API (ap…

3b81fce

…ache#10802)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prune Parquet RowGroup in a single call to `PruningPredicate::prune`, update StatisticsExtractor API #10802

Prune Parquet RowGroup in a single call to `PruningPredicate::prune`, update StatisticsExtractor API #10802

alamb commented Jun 5, 2024 •

edited

Loading

alamb Jun 5, 2024

alamb Jun 5, 2024

xinlifoobar Jun 5, 2024

alamb Jun 5, 2024

xinlifoobar Jun 5, 2024

alamb Jun 5, 2024

xinlifoobar Jun 5, 2024

NGA-TRAN left a comment

NGA-TRAN Jun 5, 2024

NGA-TRAN Jun 5, 2024

waynexia left a comment

alamb commented Jun 8, 2024

Prune Parquet RowGroup in a single call to PruningPredicate::prune, update StatisticsExtractor API #10802

Prune Parquet RowGroup in a single call to PruningPredicate::prune, update StatisticsExtractor API #10802

Conversation

alamb commented Jun 5, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NGA-TRAN left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waynexia left a comment

Choose a reason for hiding this comment

alamb commented Jun 8, 2024

Prune Parquet RowGroup in a single call to `PruningPredicate::prune`, update StatisticsExtractor API #10802

Prune Parquet RowGroup in a single call to `PruningPredicate::prune`, update StatisticsExtractor API #10802

alamb commented Jun 5, 2024 •

edited

Loading