Update parquet page pruning code to use the `StatisticsExtractor` #11483

alamb · 2024-07-15T22:06:15Z

Which issue does this PR close?

Rationale for this change

Let's use the nice API added in #10922 which isbetter tested, more performant, and handles more data types than the current code

What changes are included in this PR?

Rewrite the page pruning code to use StatisticsExtractor
Remove the previous data page extraction code
Improve comments and debug logging

Are these changes tested?

Yes, by existing tests

Here are the integration tests for page pruning
https://github.com/apache/datafusion/blob/77352b2411b5d9340374c30e21b861b0d0d46f82/datafusion/core/tests/parquet/page_pruning.rs#L83-L82

Also, the code for statistics extraction is quite well tested

Are there any user-facing changes?

No (though some queries might go faster as they will be better able to take advantage of the page index)

…isticsExtractor`

alamb · 2024-07-16T12:56:58Z

datafusion/core/src/datasource/physical_plan/parquet/mod.rs

@@ -225,7 +223,7 @@ pub struct ParquetExec {
    /// Optional predicate for pruning row groups (derived from `predicate`)
    pruning_predicate: Option<Arc<PruningPredicate>>,
    /// Optional predicate for pruning pages (derived from `predicate`)
-    page_pruning_predicate: Option<Arc<PagePruningPredicate>>,
+    page_pruning_predicate: Option<Arc<PagePruningAccessPlanFilter>>,


I renamed this to be consistent with what this is -- it isn't a pruning predicate per se

alamb · 2024-07-16T12:57:23Z

datafusion/core/src/datasource/physical_plan/parquet/mod.rs

@@ -749,26 +740,6 @@ fn should_enable_page_index(
            .unwrap_or(false)
 }

-// Convert parquet column schema to arrow data type, and just consider the


This is now handled entirely in the StatisticsConverter

alamb · 2024-07-16T12:58:18Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

@@ -1136,6 +1136,16 @@ pub struct StatisticsConverter<'a> {
 }

 impl<'a> StatisticsConverter<'a> {
+    /// Return the index of the column in the parquet file, if any


These are two new APIs I found I needed to add to the statistics converter API that is being ported upstream from @efredine in apache/arrow-rs#6046 (I'll do so later today)

alamb · 2024-07-16T12:58:33Z

datafusion/core/src/physical_optimizer/pruning.rs

-            .map(|(c, _s, _f)| c)
-            .collect::<HashSet<_>>()
-            .len()
+    /// Returns Some(column) if this is a single column predicate.


this was an easier API to work with

update: and @Dandandan 's suggestion I think has made it faster (avoids the HashSet)

alamb · 2024-07-16T12:59:06Z

datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs

-                        Some(Ok(p))
-                    }
-                    _ => None,
+                let pp =


I also added a more logging for the cases when predicates can't be used for pruning

alamb · 2024-07-16T13:00:20Z

datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs

        let row_group_indexes = access_plan.row_group_indexes();
-        for r in row_group_indexes {
+        for row_group_index in row_group_indexes {


I think this now reads easier -- more of the index manipulation is captured in PagesPruningStatistics

alamb · 2024-07-16T13:01:58Z

datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs

            debug!("Error evaluating page index predicate values {e}");
            metrics.predicate_evaluation_errors.add(1);
            return None;
        }
    };

+    // Convert the information of which pages to skip into a RowSelection
+    // that describes the ranges of rows to skip.
+    let Some(page_row_counts) = pruning_stats.page_row_counts() else {


I renamed row_vec to page_row_counts to make the logic clearer, and also added logging when it wasn't possible to construct

alamb · 2024-07-16T13:02:24Z

datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs

@@ -378,206 +354,143 @@ fn prune_pages_in_one_row_group(
    Some(RowSelection::from(vec))
 }

-fn create_row_count_in_each_page(


Moved into a function on PagesPruningStatistics

alamb · 2024-07-16T13:03:16Z

datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs

 }

-// Extract the min or max value calling `func` from page idex
-macro_rules! get_min_max_values_for_page_index {


This code is replaced by StatisticsConverter which we have now tested quite thoroughly (kudos to @marvinlanhenke and others)

alamb · 2024-07-16T13:42:07Z

@liukun4515 or @Ted-Jiang I wonder if you have time to review this code?

…tistics_converter

Ted-Jiang · 2024-07-17T07:19:39Z

@alamb thanks for ping me , i will carefully review this.

Ted-Jiang

LGTM 👍 thanks @alamb the code looks more elegant

Ted-Jiang · 2024-07-17T08:09:20Z

datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs

+    let Some(page_row_counts) = pruning_stats.page_row_counts() else {
+        debug!(
+            "Can not determine page row counts for row group {row_group_index}, skipping"
+        );


I think here need add metrics.predicate_evaluation_errors.add(1);
as above Returns None if there is an error evaluating the predicate

I added this in 79ecd80

I think it is somewhat debatable if missing row counts is a predicate evaluation error, but signaling that something went wrong will certainly help debug issues.

…tistics_converter

…counts

alamb · 2024-07-17T12:03:06Z

Thank you very much for the review @Ted-Jiang

FYI @thinkharderdev as I think you use this feature as well. I don't expect this PR to have any effect but positive but wanted to give you a heads up

Dandandan · 2024-07-17T14:08:42Z

datafusion/core/src/physical_optimizer/pruning.rs

+    /// * `a > 5 OR b < 10` returns `None`
+    /// * `true` returns None
+    pub(crate) fn single_column(&self) -> Option<&phys_expr::Column> {
+        let cols = self.iter().map(|(c, _s, _f)| c).collect::<HashSet<_>>();


It seems a bit wasteful to collect into a HashSet only to decide whether it's a single column?
We can do e.g. self.columns.windows(2).all(|[x, y]| x.0 == y.0)

This is an excellent call. I did it in 7886e29. Thank you for the suggestion

…tistics_converter

alamb · 2024-07-18T10:04:22Z

🚀

…ache#11483) * Update the parquet code prune_pages_in_one_row_group to use the `StatisticsExtractor` * fix doc * Increase evaluation error counter if error determining data page row counts * Optimize `single_column`

alamb changed the title ~~Update the parquet code prune_pages_in_one_row_group to use the StatisticsExtractor~~ WIP: Update the parquet code prune_pages_in_one_row_group to use the StatisticsExtractor Jul 15, 2024

github-actions bot added the core Core DataFusion crate label Jul 16, 2024

alamb force-pushed the alamb/prune_pages_statistics_converter branch from c9bd0af to 23f3efd Compare July 16, 2024 12:43

Update the parquet code prune_pages_in_one_row_group to use the `Stat…

62bacdd

…isticsExtractor`

alamb force-pushed the alamb/prune_pages_statistics_converter branch from 23f3efd to 62bacdd Compare July 16, 2024 12:55

alamb commented Jul 16, 2024

View reviewed changes

alamb changed the title ~~WIP: Update the parquet code prune_pages_in_one_row_group to use the StatisticsExtractor~~ Update the parquet page pruning code to use the StatisticsExtractor Jul 16, 2024

fix doc

cc6fe99

alamb marked this pull request as ready for review July 16, 2024 13:41

alamb requested a review from Ted-Jiang July 16, 2024 13:41

This was referenced Jul 16, 2024

Add parquet StatisticsConverter for arrow reader apache/arrow-rs#6046

Merged

Use upstream StatisticsConverter from arrow-rs in DataFusion #11479

Merged

Merge remote-tracking branch 'apache/main' into alamb/prune_pages_sta…

8be0e10

…tistics_converter

Ted-Jiang approved these changes Jul 17, 2024

View reviewed changes

Ted-Jiang reviewed Jul 17, 2024

View reviewed changes

alamb added 2 commits July 17, 2024 07:52

Merge remote-tracking branch 'apache/main' into alamb/prune_pages_sta…

93fd08c

…tistics_converter

Increase evaluation error counter if error determining data page row …

79ecd80

…counts

alamb changed the title ~~Update the parquet page pruning code to use the StatisticsExtractor~~ Update parquet page pruning code to use the StatisticsExtractor Jul 17, 2024

Dandandan reviewed Jul 17, 2024

View reviewed changes

alamb added 2 commits July 17, 2024 15:43

Merge remote-tracking branch 'apache/main' into alamb/prune_pages_sta…

3024a8a

…tistics_converter

Optimize single_column

7886e29

Dandandan approved these changes Jul 17, 2024

View reviewed changes

alamb merged commit b197449 into apache:main Jul 18, 2024
23 checks passed

alamb deleted the alamb/prune_pages_statistics_converter branch July 18, 2024 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update parquet page pruning code to use the `StatisticsExtractor` #11483

Update parquet page pruning code to use the `StatisticsExtractor` #11483

alamb commented Jul 15, 2024 •

edited

Loading

alamb Jul 16, 2024

alamb Jul 16, 2024

alamb Jul 16, 2024

alamb Jul 16, 2024 •

edited

Loading

alamb Jul 16, 2024

alamb Jul 16, 2024

Ted-Jiang Jul 17, 2024

alamb Jul 16, 2024

alamb Jul 16, 2024

alamb Jul 16, 2024

alamb commented Jul 16, 2024

Ted-Jiang commented Jul 17, 2024

Ted-Jiang left a comment

Ted-Jiang Jul 17, 2024

alamb Jul 17, 2024

alamb commented Jul 17, 2024

Dandandan Jul 17, 2024

alamb Jul 17, 2024

alamb commented Jul 18, 2024

Update parquet page pruning code to use the StatisticsExtractor #11483

Update parquet page pruning code to use the StatisticsExtractor #11483

Conversation

alamb commented Jul 15, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 16, 2024

Ted-Jiang commented Jul 17, 2024

Ted-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 18, 2024

Update parquet page pruning code to use the `StatisticsExtractor` #11483

Update parquet page pruning code to use the `StatisticsExtractor` #11483

alamb commented Jul 15, 2024 •

edited

Loading

alamb Jul 16, 2024 •

edited

Loading