Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune columns are all null in ParquetExec by row_counts , handle IS NOT NULL #9989

Merged
merged 7 commits into from
Apr 10, 2024

Conversation

Ted-Jiang
Copy link
Member

…atistics

Which issue does this PR close?

Part #9961 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Apr 8, 2024
@@ -1026,15 +1028,17 @@ mod tests {
column_statistics: Vec<ParquetStatistics>,
) -> RowGroupMetaData {
let mut columns = vec![];
let number_row = 1000;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before all unit test set each col with default 0 row, will trigger num_rows == num_nulls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if this could cause problems in real files (for example, if the row counts were not included in the statistics that were written into the file).

However, I double checked the code and it seems like ColumnChunkMetaData::num_values is non nullable so I think we are good.

@Ted-Jiang
Copy link
Member Author

@alamb @waynexia PTAL

@Ted-Jiang Ted-Jiang requested review from alamb and waynexia April 8, 2024 03:50
Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍 Thanks @Ted-Jiang

datafusion/core/tests/parquet/row_group_pruning.rs Outdated Show resolved Hide resolved
None
fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
let (c, _) = self.column(&column.name)?;
let scalar = ScalarValue::UInt64(Some(c.num_values() as u64));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@Ted-Jiang Ted-Jiang Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Apr 8, 2024

Thanks @Ted-Jiang -- this looks awesome -- I look forward to reviewing it later today

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ted-Jiang -- and @waynexia . I think the code looks good, though I think a few more tests would improve the coverage we could add them as a follow on PR.

Do you plan to add support in page_filter.rs as well (maybe that is why the PR is marked "Part #9961 ")?

@@ -1026,15 +1028,17 @@ mod tests {
column_statistics: Vec<ParquetStatistics>,
) -> RowGroupMetaData {
let mut columns = vec![];
let number_row = 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if this could cause problems in real files (for example, if the row counts were not included in the statistics that were written into the file).

However, I double checked the code and it seems like ColumnChunkMetaData::num_values is non nullable so I think we are good.

datafusion/core/tests/parquet/mod.rs Outdated Show resolved Hide resolved
// After pruning, only row group 1,3 should be selected
RowGroupPruningTest::new()
.with_scenario(Scenario::AllNullValues)
.with_query("SELECT * FROM t WHERE \"i8\" is Null")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a tests:

  1. i16 IS NOT NULL (to cover the opposite)
  2. i32 > 7 (prune via nulls and some via counts)

Copy link
Member Author

@Ted-Jiang Ted-Jiang Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb thanks! Add test in 11567d9, and support the isNotNull

Do you plan to add support in page_filter.rs as well (maybe that is why the PR is marked "Part
#9961 ")?

As the page level prune i prefer in next pr to keep this pr short and clean.

None
fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
let (c, _) = self.column(&column.name)?;
let scalar = ScalarValue::UInt64(Some(c.num_values() as u64));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb changed the title Prune columns are all null in ParquetExec by row_counts in pruning st… Prune columns are all null in ParquetExec by row_counts , handle IS NOT NULL Apr 10, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ted-Jiang -- the support for IS NOT NULL is also quite neat.

@alamb alamb merged commit ed37467 into apache:main Apr 10, 2024
25 checks passed
@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Apr 12, 2024

Thanks @Ted-Jiang -- the support for IS NOT NULL is also quite neat.

@alamb sorry there is a bug here i will fix this it should

/// If set `with_not` to true: which means is not null
/// datafusion use false flag to prune unit (row group, page ..)
/// Given an expression reference to `expr`, if `expr` is a column expression,
/// returns a pruning expression in terms of IsNotNull that will evaluate to true
/// if the column may contain any nonnull values, and false if definitely does not contain
/// nonnull values which any null values.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants