Skip to content

Conversation

@ulysses-you
Copy link
Contributor

What changes were proposed in this pull request?

Apply ColumnPruning for in subquery filter.

Note that, the bloom filter side has already fixed by #36047

Why are the changes needed?

The inferred in-subquery filter should apply ColumnPruning before get plan statistics and check if can be broadcasted. Otherwise, the final physical plan will be different from expected.

Does this PR introduce any user-facing change?

no

How was this patch tested?

add test

@github-actions github-actions bot added the SQL label Nov 11, 2022
case Join(_, agg: Aggregate, LeftSemi, _, _) => agg
}
assert(agg.size == 1)
assert(agg.head.fastEquals(ColumnPruning(agg.head)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test can pass without this pr because

Batch("Extract Python UDFs", Once,
ExtractPythonUDFFromJoinCondition,
// `ExtractPythonUDFFromJoinCondition` can convert a join to a cartesian product.
// Here, we rerun cartesian product check.
CheckCartesianProducts,
ExtractPythonUDFFromAggregate,
// This must be executed after `ExtractPythonUDFFromAggregate` and before `ExtractPythonUDFs`.
ExtractGroupingPythonUDFFromAggregate,
ExtractPythonUDFs,
// The eval-python node may be between Project/Filter and the scan node, which breaks
// column pruning and filter push-down. Here we rerun the related optimizer rules.
ColumnPruning,

I think it is just a coincidence since we converted the subquery to left semi join ..

@ulysses-you
Copy link
Contributor Author

cc @wangyum @cloud-fan @sigmod thank you

val actualFilterKeyExpr = mayWrapWithHash(filterCreationSideExp)
val alias = Alias(actualFilterKeyExpr, actualFilterKeyExpr.toString)()
val aggregate = Aggregate(Seq(alias), Seq(alias), filterCreationSidePlan)
val aggregate = ColumnPruning(Aggregate(Seq(alias), Seq(alias), filterCreationSidePlan))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit hacky. Can we use the catalyst framework to optimize it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that will be overkill to apply the whole optimizer rules. The filterCreationSidePlan is simple enough, it can only contain project and filter. Apply ColumnPruning here seems more safer ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about the filter push rule?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filterCreationSidePlan has been optimized , so it should be done if there is a filter can be pushed.

I think one more useful rule is CollapseProject, but it should be fine not to apply here since PhysicalOperation support collect adjacent projects.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in bd29ca7 Nov 15, 2022
@ulysses-you
Copy link
Contributor Author

thank you @cloud-fan @dongjoon-hyun

@ulysses-you ulysses-you deleted the SPARK-41112 branch November 15, 2022 09:50
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…ith in-subquery filter

### What changes were proposed in this pull request?

Apply ColumnPruning for in subquery filter.

Note that, the bloom filter side has already fixed by apache#36047

### Why are the changes needed?

The inferred in-subquery filter should apply ColumnPruning before get plan statistics and check if can be broadcasted. Otherwise, the final physical plan will be different from expected.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

add test

Closes apache#38619 from ulysses-you/SPARK-41112.

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants