Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use upstream StatisticsConverter from arrow-rs in DataFusion #11479

Merged
merged 1 commit into from
Aug 1, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 15, 2024

Which issue does this PR close?

Closes #10922
Closes #11000

Rationale for this change

Now that @efredine has ported this code to arrow in apache/arrow-rs#6046 we can remove the copy in DataFusion

What changes are included in this PR?

Remove code from DataFusion and use the upstream arrow-rs version

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jul 15, 2024
@@ -356,20 +356,24 @@ impl<'a> RowGroupPruningStatistics<'a> {
&'a self,
column: &'b Column,
) -> Result<StatisticsConverter<'a>> {
StatisticsConverter::try_new(&column.name, self.arrow_schema, self.parquet_schema)
Ok(StatisticsConverter::try_new(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is required to get the errors to convert

@@ -521,6 +518,31 @@ macro_rules! get_min_max_values_for_page_index {
}};
}

// Copy from arrow-rs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because the PagesPruningStatistics is not in terms of the StatisticsConverter code yet. I will try and do that in a separate PR and avoid the need for this

@alamb alamb force-pushed the alamb/use_upstream_statistics branch from 00eef87 to 409023e Compare July 29, 2024 11:06
use datafusion::datasource::TableProvider;
use datafusion::execution::object_store::ObjectStoreUrl;
use datafusion::parquet::arrow::arrow_reader::statistics::StatisticsConverter;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code has been ported to arrow-rs (🙏 @efredine )

}
}

impl<'a> PruningStatistics for RowGroupPruningStatistics<'a> {
fn min_values(&self, column: &Column) -> Option<ArrayRef> {
self.statistics_converter(column)
.and_then(|c| c.row_group_mins(self.metadata_iter()))
.and_then(|c| Ok(c.row_group_mins(self.metadata_iter())?))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is necessary to convert from ArrowError to DataFusionError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use map_err? or into?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion.

I played around with a few alternatives and I concluded they were not easier to understand, so I plan to leave it as is. If you feel strongly I will make a follow on PR to change.

I couldn't figure out a way to use into()

Option 1: using map_err

    fn min_values(&self, column: &Column) -> Option<ArrayRef> {
        self.statistics_converter(column)
            .and_then(|c| {
                c.row_group_mins(self.metadata_iter())
                    .map_err(DataFusionError::from)
            })
            .ok()
    }

Option 2: discard error earlier with ok()

Now there are two nested ok()s

    fn min_values(&self, column: &Column) -> Option<ArrayRef> {
        self.statistics_converter(column)
            .ok()
            .and_then(|c| c.row_group_mins(self.metadata_iter()).ok())
    }

@@ -1,287 +0,0 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to arrow

@alamb alamb marked this pull request as ready for review July 30, 2024 10:08
@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2024

@Ted-Jiang I wonder if you have a moment to review this PR?

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb and @efredine for porting this

Copy link
Member

@Ted-Jiang Ted-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me.

@alamb
Copy link
Contributor Author

alamb commented Aug 1, 2024

Thank you for the reviews @Ted-Jiang and @comphead

@alamb alamb merged commit 4884c08 into apache:main Aug 1, 2024
27 checks passed
@alamb alamb deleted the alamb/use_upstream_statistics branch August 1, 2024 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
3 participants