Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Jan 21, 2018

What changes were proposed in this pull request?

The current ReorderJoin optimizer rule cannot flatten a pattern Join -> Project -> Join because ExtractFiltersAndInnerJoins doesn't handle Project nodes. So, the current master cannot reorder joins in a query below;

val df1 = spark.range(100).selectExpr("id % 10 AS k0", s"id % 10 AS k1", s"id % 10 AS k2", "id AS v1")
val df2 = spark.range(10).selectExpr("id AS k0", "id AS v2")
val df3 = spark.range(10).selectExpr("id AS k1", "id AS v3")
val df4 = spark.range(10).selectExpr("id AS k2", "id AS v4")
df1.join(df2, "k0").join(df3, "k1").join(df4, "k2").explain(true)

== Analyzed Logical Plan ==
k2: bigint, k1: bigint, k0: bigint, v1: bigint, v2: bigint, v3: bigint, v4: bigint
Project [k2#5L, k1#4L, k0#3L, v1#6L, v2#16L, v3#24L, v4#32L]
+- Join Inner, (k2#5L = k2#31L)
   :- Project [k1#4L, k0#3L, k2#5L, v1#6L, v2#16L, v3#24L]
   :  +- Join Inner, (k1#4L = k1#23L)
   :     :- Project [k0#3L, k1#4L, k2#5L, v1#6L, v2#16L]
   :     :  +- Join Inner, (k0#3L = k0#15L)
   :     :     :- Project [(id#0L % cast(10 as bigint)) AS k0#3L, (id#0L % cast(10 as bigint)) AS k1#4L, (id#0L % cast(10 as bigint)) AS k2#5L, id#0
L AS v1#6L]
   :     :     :  +- Range (0, 100, step=1, splits=Some(4))
   :     :     +- Project [id#12L AS k0#15L, id#12L AS v2#16L]
   :     :        +- Range (0, 10, step=1, splits=Some(4))
   :     +- Project [id#20L AS k1#23L, id#20L AS v3#24L]
   :        +- Range (0, 10, step=1, splits=Some(4))
   +- Project [id#28L AS k2#31L, id#28L AS v4#32L]
      +- Range (0, 10, step=1, splits=Some(4))

To reorder the query, this pr added code to handle Project in ExtractFiltersAndInnerJoins.

This pr also fixed an output attribute reorder problem when joins reordered; it checks if a join reordered plan and an original plan have the same output attribute order with each other. If not, ReorderJoin adds Project in the top of the join reordered plan.

How was this patch tested?

This pr added new tests in JoinOptimizationSuite and modified some existing tests in StarJoinReorderSuite to check if ReorderJoin can handle Project nodes correctly. Also, it modified the existing tests in JoinReorderSuite for the output attribute reorder issue.

@SparkQA
Copy link

SparkQA commented Jan 21, 2018

Test build #86449 has finished for PR 20345 at commit 8ad6a81.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2018

Test build #86485 has finished for PR 20345 at commit ca65b9d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

retest this please

@jiangxb1987
Copy link
Contributor

This is not respect project nodes, this actually expand the ReorderJoin rule to allow handle the project-over-join nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a little while to understand that this can handle (a join b) join c versus a join (b join c) correctly. Would be great if we can explain how it works in the function comment.

Copy link
Contributor

@jiangxb1987 jiangxb1987 Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

def testExtractInnerJoins(
    plan: LogicalPlan,
    expected: Option[(Seq[(LogicalPlan, InnerLike)], Seq[Expression])]) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this ignore the plans sequence?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: adds -> may add

@maropu maropu changed the title [SPARK-23172][SQL] Respect Project nodes in ReorderJoin [SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes Jan 23, 2018
@maropu
Copy link
Member Author

maropu commented Jan 23, 2018

Thanks! @jiangxb1987 I'll address your comments and check again?

@SparkQA
Copy link

SparkQA commented Jan 23, 2018

Test build #86499 has finished for PR 20345 at commit ca65b9d.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 23, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Jan 23, 2018

Test build #86504 has finished for PR 20345 at commit f1a6558.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2018

Test build #86511 has finished for PR 20345 at commit f1a6558.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Also cc @wzhfy Do you have a bandwidth to review PRs?

@maropu
Copy link
Member Author

maropu commented Mar 6, 2018

ping

@maropu
Copy link
Member Author

maropu commented Mar 13, 2018

NVM, I understood: conditions.nonEmpty guards this case.
When I re-checked the code of the ReorderJoin rule, I found ExtractFiltersAndInnerJoins was applied into a join tree multiple times. IIUC we can use OrderedJoin to avoid this case though, any reason not to do so (I didn't check the previous discussion for that yet)? I just made a trivial patch for that and checked the metrics for the rule;

scala> import org.apache.spark.sql.catalyst.rules.RuleExecutor
scala> :paste
RuleExecutor.resetMetrics()
val numJoins = 9
spark.range(1).selectExpr((0 until numJoins).map { i => s"id AS k$i" }: _*).write.saveAsTable("t")
(0 until numJoins).foreach { i =>
  spark.range(1).selectExpr(s"id AS k$i").write.saveAsTable(s"t$i")
}
val joinSql = s"""
  SELECT *
    FROM t, ${ (0 until numJoins).map(i => s"t$i").mkString(", ") }
    WHERE ${(0 until numJoins).map(i => s"t.k$i = t$i.k$i").mkString(" AND ")}
"""
sql(joinSql).explain
println(RuleExecutor.dumpTimeSpent())

-- master
Rule                                                 Effective Time / Total Time  Effective Runs / Total Runs    
org.apache.spark.sql.catalyst.optimizer.ReorderJoin  97010505 / 126269245         2 / 26  

-- w/ the patch
Rule                                                 Effective Time / Total Time  Effective Runs / Total Runs    
org.apache.spark.sql.catalyst.optimizer.ReorderJoin  20498471 / 34859643          2 / 26 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to make sure the project has attributes only, should it be p.projectList.forall(_.isInstanceOf[Attribute])?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: when projects having attributes only => when the project has attributes only

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip projections with attributes only

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check necessary? I think check originalPlan.output != orderedJoins.output is enough, and faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have this check, operatorOptimizationRuleSet reaches fixedPoint because ReorderJoin is re-applied in the same join trees every time the optimization rule batch invoked. This case does not happen in the master because reordered joins have Project in internal nodes (Project added by following optimization rules, e.g., ColumnPruning) and this plan structure guards this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, right, thanks!

Copy link
Contributor

@wzhfy wzhfy Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test case which would fail to reorder joins before the fix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@maropu
Copy link
Member Author

maropu commented Mar 21, 2018

@wzhfy Thanks for the review and I'll update in a few days!

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88442 has finished for PR 20345 at commit 895b6a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wzhfy Added this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case can also happen without star schema enabled, right? Is it possible to use a simpler case like the one in pr description?

Copy link
Member Author

@maropu maropu Apr 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC join reorder only happens when star schema enabled now? I think this test checks the simper case?

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88464 has finished for PR 20345 at commit 9b8935d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88465 has finished for PR 20345 at commit 6d9947b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Mar 22, 2018

ping @gatorsmile @wzhfy

@SparkQA
Copy link

SparkQA commented Mar 22, 2018

Test build #88503 has finished for PR 20345 at commit a7ae183.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Mar 22, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Mar 22, 2018

Test build #88511 has finished for PR 20345 at commit a7ae183.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Mar 28, 2018

kindly ping

@maropu
Copy link
Member Author

maropu commented Apr 1, 2018

ping

@maropu
Copy link
Member Author

maropu commented Aug 21, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 22, 2018

Test build #95065 has finished for PR 20345 at commit 39462fb.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Aug 22, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 22, 2018

Test build #95087 has finished for PR 20345 at commit 39462fb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Aug 22, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 22, 2018

Test build #95104 has finished for PR 20345 at commit 39462fb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Aug 23, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 23, 2018

Test build #95131 has finished for PR 20345 at commit 39462fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114189 has finished for PR 20345 at commit 025c540.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes [WIP][SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes Dec 15, 2019
@SparkQA
Copy link

SparkQA commented Dec 15, 2019

Test build #115350 has finished for PR 20345 at commit f7f3451.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Dec 15, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Dec 15, 2019

Test build #115354 has finished for PR 20345 at commit f7f3451.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 16, 2019

Test build #115369 has finished for PR 20345 at commit f63bee3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [WIP][SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes [SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes Dec 16, 2019
@maropu
Copy link
Member Author

maropu commented Jan 15, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116762 has finished for PR 20345 at commit f63bee3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2020

Test build #117188 has finished for PR 20345 at commit 37e5fe2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

github-actions bot commented May 1, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 1, 2020
@maropu maropu closed this May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants