[SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters #38511

cloud-fan · 2022-11-04T14:57:41Z

What changes were proposed in this pull request?

Today, Spark does column pruning in 3 steps:

The rule PushDownPredicates pushes down Filters as closer to the scan node as possible.
The rule ColumnPruning generates Project below many operators, to prune columns before evaluating these operators. One exception is Filter. We do not generate Project below Filter as it conflicts with PushDownPredicates.
After the above 2 steps, we should have a plan pattern like Project(..., Filter(..., Relation)), and we have rules (DS v1 and v2 have different rules) to match this pattern using PhysicalOperation, then apply filter pushdown and column pruning.

This works fine in most cases, but we can not always combine adjacent Filters into one, due to non-deterministic predicates. For example, Project(a, Filter(rand() > 0.5, Filter(rand() < 0.8), Relation))). PhysicalOperation can only match Filter(rand() < 0.8), Relation) and we can't do column pruning today.

This PR fixes this problem by adding a variant of PhysicalOperation: ScanOperation. It keeps all the adjacent Filters, so that it can match more plan patterns and do column pruning better. The caller sides are also updated to restore the Filters w.r.t. to their original order in the query plan.

Why are the changes needed?

Apply column pruning in more cases.

Does this PR introduce any user-facing change?

no

How was this patch tested?

new tests

cloud-fan · 2022-11-09T07:16:58Z

cc @viirya @sigmod @hvanhovell

cloud-fan · 2022-11-10T16:38:45Z

also cc @wangyum @ulysses-you

sigmod · 2022-11-11T00:35:16Z

cc @rkkorlapati-db

gengliangwang · 2022-11-15T19:55:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

TBH, the comment here is hard to understand..

gengliangwang · 2022-11-15T20:43:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

So for filter pushdown, we will use the last filter. For schema pruning, we will use all the filters.
I wonder if we should return both allFilters and pushdownFilters to make the syntax clear.

gengliangwang · 2022-11-15T20:44:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

Nit: add a simple comment

viirya · 2022-11-16T09:04:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

Previously, this is filters.forall(_.deterministic), why it is relaxed here too? I think it is not under canKeepMultipleFilters condition below.

This is the core change of this PR. PhysicalOperation returns a single filter condition, which means it combines filters, and we have to make sure all the filters are deterministic. ScanOperation returns multiple filter conditions and does not have this restriction.

viirya · 2022-11-16T09:18:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala

Hmm, is that same as before?

This constructs a new Filter with projected predicates (reduced by And). But this change reduces all projected predicates from all adjoining Filters which can be non-deterministic?

cloud-fan · 2022-11-16T15:55:42Z

pushed a refactor to make the code easier to understand, please take another look, thanks!

wangyum · 2022-11-17T06:43:11Z

Merged to master.

### What changes were proposed in this pull request? This is a followup of #38511 to fix a mistake: we should respect the original `Filter` operator order when re-constructing the query plan. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #38684 from cloud-fan/column-pruning. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…nondeterministic predicates ### What changes were proposed in this pull request? This PR fixes a regression caused by #38511 . For `FROM t WHERE rand() > 0.5 AND col = 1`, we can still push down `col = 1` because we don't guarantee the predicates evaluation order within a `Filter`. This PR updates `ScanOperation` to consider this case and bring back the previous pushdown behavior. ### Why are the changes needed? fix perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #38746 from cloud-fan/filter. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…tic Filters ### What changes were proposed in this pull request? Today, Spark does column pruning in 3 steps: 1. The rule `PushDownPredicates` pushes down `Filter`s as closer to the scan node as possible. 2. The rule `ColumnPruning` generates `Project` below many operators, to prune columns before evaluating these operators. One exception is `Filter`. We do not generate `Project` below `Filter` as it conflicts with `PushDownPredicates`. 3. After the above 2 steps, we should have a plan pattern like `Project(..., Filter(..., Relation))`, and we have rules (DS v1 and v2 have different rules) to match this pattern using `PhysicalOperation`, then apply filter pushdown and column pruning. This works fine in most cases, but we can not always combine adjacent `Filter`s into one, due to non-deterministic predicates. For example, `Project(a, Filter(rand() > 0.5, Filter(rand() < 0.8), Relation)))`. `PhysicalOperation` can only match `Filter(rand() < 0.8), Relation)` and we can't do column pruning today. This PR fixes this problem by adding a variant of `PhysicalOperation`: `ScanOperation`. It keeps all the adjacent `Filter`s, so that it can match more plan patterns and do column pruning better. The caller sides are also updated to restore the `Filter`s w.r.t. to their original order in the query plan. ### Why are the changes needed? Apply column pruning in more cases. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes apache#38511 from cloud-fan/column-pruning. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Yuming Wang <[email protected]>

### What changes were proposed in this pull request? This is a followup of apache#38511 to fix a mistake: we should respect the original `Filter` operator order when re-constructing the query plan. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes apache#38684 from cloud-fan/column-pruning. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…nondeterministic predicates ### What changes were proposed in this pull request? This PR fixes a regression caused by apache#38511 . For `FROM t WHERE rand() > 0.5 AND col = 1`, we can still push down `col = 1` because we don't guarantee the predicates evaluation order within a `Filter`. This PR updates `ScanOperation` to consider this case and bring back the previous pushdown behavior. ### Why are the changes needed? fix perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes apache#38746 from cloud-fan/filter. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…tic Filters ### What changes were proposed in this pull request? Today, Spark does column pruning in 3 steps: 1. The rule `PushDownPredicates` pushes down `Filter`s as closer to the scan node as possible. 2. The rule `ColumnPruning` generates `Project` below many operators, to prune columns before evaluating these operators. One exception is `Filter`. We do not generate `Project` below `Filter` as it conflicts with `PushDownPredicates`. 3. After the above 2 steps, we should have a plan pattern like `Project(..., Filter(..., Relation))`, and we have rules (DS v1 and v2 have different rules) to match this pattern using `PhysicalOperation`, then apply filter pushdown and column pruning. This works fine in most cases, but we can not always combine adjacent `Filter`s into one, due to non-deterministic predicates. For example, `Project(a, Filter(rand() > 0.5, Filter(rand() < 0.8), Relation)))`. `PhysicalOperation` can only match `Filter(rand() < 0.8), Relation)` and we can't do column pruning today. This PR fixes this problem by adding a variant of `PhysicalOperation`: `ScanOperation`. It keeps all the adjacent `Filter`s, so that it can match more plan patterns and do column pruning better. The caller sides are also updated to restore the `Filter`s w.r.t. to their original order in the query plan. ### Why are the changes needed? Apply column pruning in more cases. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes apache#38511 from cloud-fan/column-pruning. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Yuming Wang <[email protected]>

### What changes were proposed in this pull request? This is a followup of apache#38511 to fix a mistake: we should respect the original `Filter` operator order when re-constructing the query plan. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes apache#38684 from cloud-fan/column-pruning. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…nondeterministic predicates ### What changes were proposed in this pull request? This PR fixes a regression caused by apache#38511 . For `FROM t WHERE rand() > 0.5 AND col = 1`, we can still push down `col = 1` because we don't guarantee the predicates evaluation order within a `Filter`. This PR updates `ScanOperation` to consider this case and bring back the previous pushdown behavior. ### Why are the changes needed? fix perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes apache#38746 from cloud-fan/filter. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…tic Filters ### What changes were proposed in this pull request? Today, Spark does column pruning in 3 steps: 1. The rule `PushDownPredicates` pushes down `Filter`s as closer to the scan node as possible. 2. The rule `ColumnPruning` generates `Project` below many operators, to prune columns before evaluating these operators. One exception is `Filter`. We do not generate `Project` below `Filter` as it conflicts with `PushDownPredicates`. 3. After the above 2 steps, we should have a plan pattern like `Project(..., Filter(..., Relation))`, and we have rules (DS v1 and v2 have different rules) to match this pattern using `PhysicalOperation`, then apply filter pushdown and column pruning. This works fine in most cases, but we can not always combine adjacent `Filter`s into one, due to non-deterministic predicates. For example, `Project(a, Filter(rand() > 0.5, Filter(rand() < 0.8), Relation)))`. `PhysicalOperation` can only match `Filter(rand() < 0.8), Relation)` and we can't do column pruning today. This PR fixes this problem by adding a variant of `PhysicalOperation`: `ScanOperation`. It keeps all the adjacent `Filter`s, so that it can match more plan patterns and do column pruning better. The caller sides are also updated to restore the `Filter`s w.r.t. to their original order in the query plan. ### Why are the changes needed? Apply column pruning in more cases. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes apache#38511 from cloud-fan/column-pruning. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Yuming Wang <[email protected]>

### What changes were proposed in this pull request? This is a followup of apache#38511 to fix a mistake: we should respect the original `Filter` operator order when re-constructing the query plan. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes apache#38684 from cloud-fan/column-pruning. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…nondeterministic predicates ### What changes were proposed in this pull request? This PR fixes a regression caused by apache#38511 . For `FROM t WHERE rand() > 0.5 AND col = 1`, we can still push down `col = 1` because we don't guarantee the predicates evaluation order within a `Filter`. This PR updates `ScanOperation` to consider this case and bring back the previous pushdown behavior. ### Why are the changes needed? fix perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes apache#38746 from cloud-fan/filter. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This pr aims to upgrade Arrow from 14.0.2 to 15.0.0, this version fixes the compatibility issue with Netty 4.1.104.Final(GH-39265). Additionally, since the `arrow-vector` module uses `eclipse-collections` to replace `netty-common` as a compile-level dependency, Apache Spark has added a dependency on `eclipse-collections` after upgrading to use Arrow 15.0.0. ### Why are the changes needed? The new version brings the following major changes: Bug Fixes GH-34610 - [Java] Fix valueCount and field name when loading/transferring NullVector GH-38242 - [Java] Fix incorrect internal struct accounting for DenseUnionVector#getBufferSizeFor GH-38254 - [Java] Add reusable buffer getters to char/binary vectors GH-38366 - [Java] Fix Murmur hash on buffers less than 4 bytes GH-38387 - [Java] Fix JDK8 compilation issue with TestAllTypes GH-38614 - [Java] Add VarBinary and VarCharWriter helper methods to more writers GH-38725 - [Java] decompression in Lz4CompressionCodec.java does not set writer index New Features and Improvements GH-38511 - [Java] Add getTransferPair(Field, BufferAllocator, CallBack) for StructVector and MapVector GH-14936 - [Java] Remove netty dependency from arrow-vector GH-38990 - [Java] Upgrade to flatc version 23.5.26 GH-39265 - [Java] Make it run well with the netty newest version 4.1.104 The full release notes as follows: - https://arrow.apache.org/release/15.0.0.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44797 from LuciferYang/SPARK-46718. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added the SQL label Nov 4, 2022

cloud-fan force-pushed the column-pruning branch 2 times, most recently from db83295 to d12ecb5 Compare November 8, 2022 06:23

cloud-fan changed the title ~~[WIP][SPARK-41017][SQL] Do not push Filter through reference-only Project~~ [SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters Nov 8, 2022

cloud-fan force-pushed the column-pruning branch from d12ecb5 to 33a3706 Compare November 8, 2022 15:02

gengliangwang reviewed Nov 15, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala Outdated

Copy link

Member

gengliangwang Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add a simple comment

viirya reviewed Nov 16, 2022

View reviewed changes

cloud-fan force-pushed the column-pruning branch from 33a3706 to 651e44d Compare November 16, 2022 15:55

Support column pruning with multiple nondeterministic Filters

b8ef19f

cloud-fan force-pushed the column-pruning branch from 651e44d to b8ef19f Compare November 17, 2022 01:53

wangyum approved these changes Nov 17, 2022

View reviewed changes

viirya approved these changes Nov 17, 2022

View reviewed changes

gengliangwang approved these changes Nov 17, 2022

View reviewed changes

wangyum closed this in f3ad94d Nov 17, 2022

cloud-fan mentioned this pull request Nov 17, 2022

[SPARK-41017][SQL][FOLLOWUP] Respect the original Filter operator order #38684

Closed

cloud-fan mentioned this pull request Nov 21, 2022

[SPARK-41017][SQL][FOLLOWUP] Push Filter with both deterministic and nondeterministic predicates #38746

Closed

LuciferYang mentioned this pull request Jan 23, 2024

[SPARK-46718][BUILD] Upgrade Arrow to 15.0.0 #44797

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters #38511

[SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters #38511

Uh oh!

cloud-fan commented Nov 4, 2022 •

edited

Loading

Uh oh!

cloud-fan commented Nov 9, 2022

Uh oh!

cloud-fan commented Nov 10, 2022

Uh oh!

sigmod commented Nov 11, 2022

Uh oh!

gengliangwang Nov 15, 2022

Uh oh!

gengliangwang Nov 15, 2022

Uh oh!

gengliangwang Nov 15, 2022

Uh oh!

viirya Nov 16, 2022

Uh oh!

cloud-fan Nov 16, 2022

Uh oh!

viirya Nov 16, 2022

Uh oh!

cloud-fan commented Nov 16, 2022

Uh oh!

wangyum commented Nov 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters #38511

[SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters #38511

Uh oh!

Conversation

cloud-fan commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Nov 9, 2022

Uh oh!

cloud-fan commented Nov 10, 2022

Uh oh!

sigmod commented Nov 11, 2022

Uh oh!

gengliangwang Nov 15, 2022

Choose a reason for hiding this comment

Uh oh!

gengliangwang Nov 15, 2022

Choose a reason for hiding this comment

Uh oh!

gengliangwang Nov 15, 2022

Choose a reason for hiding this comment

Uh oh!

viirya Nov 16, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 16, 2022

Choose a reason for hiding this comment

Uh oh!

viirya Nov 16, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 16, 2022

Uh oh!

wangyum commented Nov 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloud-fan commented Nov 4, 2022 •

edited

Loading