[SPARK-41112][SQL] RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter #38619

ulysses-you · 2022-11-11T07:25:58Z

What changes were proposed in this pull request?

Apply ColumnPruning for in subquery filter.

Note that, the bloom filter side has already fixed by #36047

Why are the changes needed?

The inferred in-subquery filter should apply ColumnPruning before get plan statistics and check if can be broadcasted. Otherwise, the final physical plan will be different from expected.

Does this PR introduce any user-facing change?

no

How was this patch tested?

add test

ulysses-you · 2022-11-11T07:29:36Z

sql/core/src/test/scala/org/apache/spark/sql/InjectRuntimeFilterSuite.scala

+          case Join(_, agg: Aggregate, LeftSemi, _, _) => agg
+        }
+        assert(agg.size == 1)
+        assert(agg.head.fastEquals(ColumnPruning(agg.head)))


this test can pass without this pr because

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala

Lines 68 to 79 in 38897b1

Batch("Extract Python UDFs", Once,

ExtractPythonUDFFromJoinCondition,

// `ExtractPythonUDFFromJoinCondition` can convert a join to a cartesian product.

// Here, we rerun cartesian product check.

CheckCartesianProducts,

ExtractPythonUDFFromAggregate,

// This must be executed after `ExtractPythonUDFFromAggregate` and before `ExtractPythonUDFs`.

ExtractGroupingPythonUDFFromAggregate,

ExtractPythonUDFs,

// The eval-python node may be between Project/Filter and the scan node, which breaks

// column pruning and filter push-down. Here we rerun the related optimizer rules.

ColumnPruning,

I think it is just a coincidence since we converted the subquery to left semi join ..

ulysses-you · 2022-11-11T07:30:44Z

cc @wangyum @cloud-fan @sigmod thank you

cloud-fan · 2022-11-11T09:13:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala

    val actualFilterKeyExpr = mayWrapWithHash(filterCreationSideExp)
    val alias = Alias(actualFilterKeyExpr, actualFilterKeyExpr.toString)()
-    val aggregate = Aggregate(Seq(alias), Seq(alias), filterCreationSidePlan)
+    val aggregate = ColumnPruning(Aggregate(Seq(alias), Seq(alias), filterCreationSidePlan))


This looks a bit hacky. Can we use the catalyst framework to optimize it?

I'm afraid that will be overkill to apply the whole optimizer rules. The filterCreationSidePlan is simple enough, it can only contain project and filter. Apply ColumnPruning here seems more safer ?

how about the filter push rule?

The filterCreationSidePlan has been optimized , so it should be done if there is a filter can be pushed.

I think one more useful rule is CollapseProject, but it should be fine not to apply here since PhysicalOperation support collect adjacent projects.

dongjoon-hyun

+1, LGTM.

cloud-fan · 2022-11-15T04:54:26Z

thanks, merging to master!

ulysses-you · 2022-11-15T09:50:48Z

thank you @cloud-fan @dongjoon-hyun

…ith in-subquery filter ### What changes were proposed in this pull request? Apply ColumnPruning for in subquery filter. Note that, the bloom filter side has already fixed by apache#36047 ### Why are the changes needed? The inferred in-subquery filter should apply ColumnPruning before get plan statistics and check if can be broadcasted. Otherwise, the final physical plan will be different from expected. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add test Closes apache#38619 from ulysses-you/SPARK-41112. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter

30780ea

github-actions bot added the SQL label Nov 11, 2022

ulysses-you commented Nov 11, 2022

View reviewed changes

cloud-fan reviewed Nov 11, 2022

View reviewed changes

cloud-fan mentioned this pull request Nov 14, 2022

[SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice #38626

Closed

cloud-fan approved these changes Nov 14, 2022

View reviewed changes

dongjoon-hyun approved these changes Nov 15, 2022

View reviewed changes

cloud-fan closed this in bd29ca7 Nov 15, 2022

ulysses-you deleted the SPARK-41112 branch November 15, 2022 09:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-41112][SQL] RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter #38619

[SPARK-41112][SQL] RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter #38619

Uh oh!

ulysses-you commented Nov 11, 2022

Uh oh!

ulysses-you Nov 11, 2022

Uh oh!

ulysses-you commented Nov 11, 2022

Uh oh!

cloud-fan Nov 11, 2022

Uh oh!

ulysses-you Nov 11, 2022

Uh oh!

cloud-fan Nov 14, 2022

Uh oh!

ulysses-you Nov 14, 2022

Uh oh!

dongjoon-hyun left a comment

Uh oh!

cloud-fan commented Nov 15, 2022

Uh oh!

ulysses-you commented Nov 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	Batch("Extract Python UDFs", Once,
	ExtractPythonUDFFromJoinCondition,
	// `ExtractPythonUDFFromJoinCondition` can convert a join to a cartesian product.
	// Here, we rerun cartesian product check.
	CheckCartesianProducts,
	ExtractPythonUDFFromAggregate,
	// This must be executed after `ExtractPythonUDFFromAggregate` and before `ExtractPythonUDFs`.
	ExtractGroupingPythonUDFFromAggregate,
	ExtractPythonUDFs,
	// The eval-python node may be between Project/Filter and the scan node, which breaks
	// column pruning and filter push-down. Here we rerun the related optimizer rules.
	ColumnPruning,

[SPARK-41112][SQL] RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter #38619

[SPARK-41112][SQL] RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter #38619

Uh oh!

Conversation

ulysses-you commented Nov 11, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ulysses-you Nov 11, 2022

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Nov 11, 2022

Uh oh!

cloud-fan Nov 11, 2022

Choose a reason for hiding this comment

Uh oh!

ulysses-you Nov 11, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 14, 2022

Choose a reason for hiding this comment

Uh oh!

ulysses-you Nov 14, 2022

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 15, 2022

Uh oh!

ulysses-you commented Nov 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants