[SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes #20345

maropu · 2018-01-21T18:52:21Z

What changes were proposed in this pull request?

The current ReorderJoin optimizer rule cannot flatten a pattern Join -> Project -> Join because ExtractFiltersAndInnerJoins doesn't handle Project nodes. So, the current master cannot reorder joins in a query below;

val df1 = spark.range(100).selectExpr("id % 10 AS k0", s"id % 10 AS k1", s"id % 10 AS k2", "id AS v1")
val df2 = spark.range(10).selectExpr("id AS k0", "id AS v2")
val df3 = spark.range(10).selectExpr("id AS k1", "id AS v3")
val df4 = spark.range(10).selectExpr("id AS k2", "id AS v4")
df1.join(df2, "k0").join(df3, "k1").join(df4, "k2").explain(true)

== Analyzed Logical Plan ==
k2: bigint, k1: bigint, k0: bigint, v1: bigint, v2: bigint, v3: bigint, v4: bigint
Project [k2#5L, k1#4L, k0#3L, v1#6L, v2#16L, v3#24L, v4#32L]
+- Join Inner, (k2#5L = k2#31L)
   :- Project [k1#4L, k0#3L, k2#5L, v1#6L, v2#16L, v3#24L]
   :  +- Join Inner, (k1#4L = k1#23L)
   :     :- Project [k0#3L, k1#4L, k2#5L, v1#6L, v2#16L]
   :     :  +- Join Inner, (k0#3L = k0#15L)
   :     :     :- Project [(id#0L % cast(10 as bigint)) AS k0#3L, (id#0L % cast(10 as bigint)) AS k1#4L, (id#0L % cast(10 as bigint)) AS k2#5L, id#0
L AS v1#6L]
   :     :     :  +- Range (0, 100, step=1, splits=Some(4))
   :     :     +- Project [id#12L AS k0#15L, id#12L AS v2#16L]
   :     :        +- Range (0, 10, step=1, splits=Some(4))
   :     +- Project [id#20L AS k1#23L, id#20L AS v3#24L]
   :        +- Range (0, 10, step=1, splits=Some(4))
   +- Project [id#28L AS k2#31L, id#28L AS v4#32L]
      +- Range (0, 10, step=1, splits=Some(4))

To reorder the query, this pr added code to handle Project in ExtractFiltersAndInnerJoins.

This pr also fixed an output attribute reorder problem when joins reordered; it checks if a join reordered plan and an original plan have the same output attribute order with each other. If not, ReorderJoin adds Project in the top of the join reordered plan.

How was this patch tested?

This pr added new tests in JoinOptimizationSuite and modified some existing tests in StarJoinReorderSuite to check if ReorderJoin can handle Project nodes correctly. Also, it modified the existing tests in JoinReorderSuite for the output attribute reorder issue.

SparkQA · 2018-01-21T20:43:17Z

Test build #86449 has finished for PR 20345 at commit 8ad6a81.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-22T17:40:17Z

Test build #86485 has finished for PR 20345 at commit ca65b9d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-01-22T23:04:27Z

retest this please

jiangxb1987 · 2018-01-23T00:32:56Z

This is not respect project nodes, this actually expand the ReorderJoin rule to allow handle the project-over-join nodes.

jiangxb1987 · 2018-01-23T00:34:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

It took me a little while to understand that this can handle (a join b) join c versus a join (b join c) correctly. Would be great if we can explain how it works in the function comment.

jiangxb1987 · 2018-01-23T00:41:20Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinOptimizationSuite.scala

nit:

def testExtractInnerJoins( plan: LogicalPlan, expected: Option[(Seq[(LogicalPlan, InnerLike)], Seq[Expression])]) {

jiangxb1987 · 2018-01-23T00:43:43Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinOptimizationSuite.scala

Won't this ignore the plans sequence?

jiangxb1987 · 2018-01-23T00:45:17Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinOptimizationSuite.scala

nit: adds -> may add

maropu · 2018-01-23T01:39:54Z

Thanks! @jiangxb1987 I'll address your comments and check again?

SparkQA · 2018-01-23T04:07:55Z

Test build #86499 has finished for PR 20345 at commit ca65b9d.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-01-23T04:36:13Z

retest this please.

SparkQA · 2018-01-23T04:58:39Z

Test build #86504 has finished for PR 20345 at commit f1a6558.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-23T07:52:28Z

Test build #86511 has finished for PR 20345 at commit f1a6558.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-27T19:54:31Z

Also cc @wzhfy Do you have a bandwidth to review PRs?

maropu · 2018-03-06T05:47:33Z

ping

maropu · 2018-03-13T09:24:08Z

NVM, I understood: conditions.nonEmpty guards this case.
When I re-checked the code of the ReorderJoin rule, I found ExtractFiltersAndInnerJoins was applied into a join tree multiple times. IIUC we can use OrderedJoin to avoid this case though, any reason not to do so (I didn't check the previous discussion for that yet)? I just made a trivial patch for that and checked the metrics for the rule;

scala> import org.apache.spark.sql.catalyst.rules.RuleExecutor
scala> :paste
RuleExecutor.resetMetrics()
val numJoins = 9
spark.range(1).selectExpr((0 until numJoins).map { i => s"id AS k$i" }: _*).write.saveAsTable("t")
(0 until numJoins).foreach { i =>
  spark.range(1).selectExpr(s"id AS k$i").write.saveAsTable(s"t$i")
}
val joinSql = s"""
  SELECT *
    FROM t, ${ (0 until numJoins).map(i => s"t$i").mkString(", ") }
    WHERE ${(0 until numJoins).map(i => s"t.k$i = t$i.k$i").mkString(" AND ")}
"""
sql(joinSql).explain
println(RuleExecutor.dumpTimeSpent())

-- master
Rule                                                 Effective Time / Total Time  Effective Runs / Total Runs    
org.apache.spark.sql.catalyst.optimizer.ReorderJoin  97010505 / 126269245         2 / 26  

-- w/ the patch
Rule                                                 Effective Time / Total Time  Effective Runs / Total Runs    
org.apache.spark.sql.catalyst.optimizer.ReorderJoin  20498471 / 34859643          2 / 26

wzhfy · 2018-03-20T09:17:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

If we want to make sure the project has attributes only, should it be p.projectList.forall(_.isInstanceOf[Attribute])?

wzhfy · 2018-03-20T09:18:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

nit: when projects having attributes only => when the project has attributes only

wzhfy · 2018-03-20T09:19:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

skip projections with attributes only

wzhfy · 2018-03-20T11:00:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

Is this check necessary? I think check originalPlan.output != orderedJoins.output is enough, and faster.

If we don't have this check, operatorOptimizationRuleSet reaches fixedPoint because ReorderJoin is re-applied in the same join trees every time the optimization rule batch invoked. This case does not happen in the master because reordered joins have Project in internal nodes (Project added by following optimization rules, e.g., ColumnPruning) and this plan structure guards this case.

ah, right, thanks!

wzhfy · 2018-03-20T11:09:43Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinOptimizationSuite.scala

Could you add a test case which would fail to reorder joins before the fix?

maropu · 2018-03-21T00:28:39Z

@wzhfy Thanks for the review and I'll update in a few days!

SparkQA · 2018-03-21T05:57:15Z

Test build #88442 has finished for PR 20345 at commit 895b6a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-03-21T09:56:44Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinOptimizationSuite.scala

@wzhfy Added this test.

The case can also happen without star schema enabled, right? Is it possible to use a simpler case like the one in pr description?

IIUC join reorder only happens when star schema enabled now? I think this test checks the simper case?

SparkQA · 2018-03-21T10:04:16Z

Test build #88464 has finished for PR 20345 at commit 9b8935d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-21T13:43:47Z

Test build #88465 has finished for PR 20345 at commit 6d9947b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-03-22T03:20:10Z

ping @gatorsmile @wzhfy

SparkQA · 2018-03-22T07:05:01Z

Test build #88503 has finished for PR 20345 at commit a7ae183.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-03-22T07:21:24Z

retest this please

SparkQA · 2018-03-22T10:54:50Z

Test build #88511 has finished for PR 20345 at commit a7ae183.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-03-28T15:21:51Z

kindly ping

maropu · 2018-04-01T21:40:07Z

ping

maropu · 2018-08-21T23:58:42Z

retest this please

SparkQA · 2018-08-22T00:11:22Z

Test build #95065 has finished for PR 20345 at commit 39462fb.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-08-22T06:43:23Z

retest this please

SparkQA · 2018-08-22T07:05:01Z

Test build #95087 has finished for PR 20345 at commit 39462fb.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-08-22T14:03:07Z

retest this please

SparkQA · 2018-08-22T17:08:27Z

Test build #95104 has finished for PR 20345 at commit 39462fb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-08-23T00:53:04Z

retest this please

SparkQA · 2018-08-23T04:37:24Z

Test build #95131 has finished for PR 20345 at commit 39462fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-21T03:42:13Z

Test build #114189 has finished for PR 20345 at commit 025c540.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-15T08:05:02Z

Test build #115350 has finished for PR 20345 at commit f7f3451.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-12-15T08:54:58Z

retest this please

SparkQA · 2019-12-15T10:27:18Z

Test build #115354 has finished for PR 20345 at commit f7f3451.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-16T04:28:28Z

Test build #115369 has finished for PR 20345 at commit f63bee3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-01-15T07:35:00Z

retest this please

SparkQA · 2020-01-15T08:05:01Z

Test build #116762 has finished for PR 20345 at commit f63bee3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-21T18:57:11Z

Test build #117188 has finished for PR 20345 at commit 37e5fe2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2020-05-01T00:11:13Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

jiangxb1987 reviewed Jan 23, 2018

View reviewed changes

maropu changed the title ~~[SPARK-23172][SQL] Respect Project nodes in ReorderJoin~~ [SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes Jan 23, 2018

wzhfy reviewed Mar 20, 2018

View reviewed changes

maropu commented Mar 21, 2018

View reviewed changes

maropu force-pushed the FixFlattenJoins branch from 9b8935d to 6d9947b Compare March 21, 2018 10:30

dongjoon-hyun added the SQL label Jun 14, 2019

maropu force-pushed the FixFlattenJoins branch from 39462fb to 025c540 Compare November 21, 2019 02:22

maropu force-pushed the FixFlattenJoins branch from 025c540 to 1df7d2f Compare December 15, 2019 06:55

maropu changed the title ~~[SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes~~ [WIP][SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes Dec 15, 2019

maropu force-pushed the FixFlattenJoins branch from 1df7d2f to f7f3451 Compare December 15, 2019 06:58

maropu force-pushed the FixFlattenJoins branch from f7f3451 to f63bee3 Compare December 16, 2019 00:32

maropu changed the title ~~[WIP][SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes~~ [SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes Dec 16, 2019

maropu added 3 commits January 21, 2020 23:39

Fix

fffd0fd

Fix

85d2435

Fix

37e5fe2

maropu force-pushed the FixFlattenJoins branch from f63bee3 to 37e5fe2 Compare January 21, 2020 14:39

github-actions bot added the Stale label May 1, 2020

maropu closed this May 1, 2020

[SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes #20345

[SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes #20345

Uh oh!

Conversation

maropu commented Jan 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 21, 2018

Uh oh!

SparkQA commented Jan 22, 2018

Uh oh!

jiangxb1987 commented Jan 22, 2018

Uh oh!

jiangxb1987 commented Jan 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Jan 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Jan 23, 2018

Uh oh!

SparkQA commented Jan 23, 2018

Uh oh!

maropu commented Jan 23, 2018

Uh oh!

SparkQA commented Jan 23, 2018

Uh oh!

SparkQA commented Jan 23, 2018

Uh oh!

gatorsmile commented Jan 27, 2018

Uh oh!

maropu commented Mar 6, 2018

Uh oh!

maropu commented Mar 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy Mar 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Mar 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Apr 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 21, 2018

Uh oh!

SparkQA commented Mar 21, 2018

maropu commented Jan 21, 2018 •

edited

Loading

jiangxb1987 Jan 23, 2018 •

edited

Loading

maropu commented Mar 13, 2018 •

edited

Loading

wzhfy Mar 20, 2018 •

edited

Loading

maropu commented Mar 21, 2018 •

edited

Loading

maropu Apr 11, 2018 •

edited

Loading

maropu commented Mar 28, 2018 •

edited

Loading