[SPARK-27225][SQL] Implement join strategy hints #24164

maryannxue · 2019-03-21T03:31:34Z

What changes were proposed in this pull request?

This PR extends the existing BROADCAST join hint (for both broadcast-hash join and broadcast-nested-loop join) by implementing other join strategy hints corresponding to the rest of Spark's existing join strategies: shuffle-hash, sort-merge, cartesian-product. The hint names: SHUFFLE_MERGE, SHUFFLE_HASH, SHUFFLE_REPLICATE_NL are partly different from the code names in order to make them clearer to users and reflect the actual algorithms better.

The hinted strategy will be used for the join with which it is associated if it is applicable/doable.

Conflict resolving rules in case of multiple hints:

Conflicts within either side of the join: take the first strategy hint specified in the query, or the top hint node in Dataset. For example, in "select /*+ merge(t1) / /+ broadcast(t1) */ k1, v2 from t1 join t2 on t1.k1 = t2.k2", take "merge(t1)"; in df1.hint("merge").hint("shuffle_hash").join(df2), take "shuffle_hash". This is a general hint conflict resolving strategy, not specific to join strategy hint.
Conflicts between two sides of the join:
a) In case of different strategy hints, hints are prioritized as BROADCAST over SHUFFLE_MERGE over SHUFFLE_HASH over SHUFFLE_REPLICATE_NL.
b) In case of same strategy hints but conflicts in build side, choose the build side based on join type and size.

How was this patch tested?

Added new UTs.

SparkQA · 2019-03-21T07:05:02Z

Test build #103754 has finished for PR 24164 at commit 1426294.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ResolveJoinStrategyHints(conf: SQLConf) extends Rule[LogicalPlan]
case class HintInfo(strategy: Option[JoinStrategyHint] = None)
sealed abstract class JoinStrategyHint

maropu · 2019-03-21T10:04:04Z

retest this please

SparkQA · 2019-03-21T10:06:05Z

Test build #103759 has finished for PR 24164 at commit 1426294.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ResolveJoinStrategyHints(conf: SQLConf) extends Rule[LogicalPlan]
case class HintInfo(strategy: Option[JoinStrategyHint] = None)
sealed abstract class JoinStrategyHint

gatorsmile · 2019-03-21T15:58:43Z

retest this please

gatorsmile · 2019-03-21T16:42:44Z

ok to test

gatorsmile · 2019-03-21T16:42:50Z

test this please

SparkQA · 2019-03-21T16:46:16Z

Test build #103776 has started for PR 24164 at commit 1426294.

shaneknapp · 2019-03-21T19:57:18Z

test this please

SparkQA · 2019-03-22T00:16:35Z

Test build #103777 has finished for PR 24164 at commit 1426294.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ResolveJoinStrategyHints(conf: SQLConf) extends Rule[LogicalPlan]
case class HintInfo(strategy: Option[JoinStrategyHint] = None)
sealed abstract class JoinStrategyHint

maropu · 2019-03-22T01:37:28Z

In the conflict case, we need to implicitly resolve it? In case of complicated queries, it seems to become difficult that uses understand the hint behaviours, I think.

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

maropu · 2019-03-22T01:46:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

+case object SHUFFLE_REPLICATE_NL extends JoinStrategyHint {
+  override def displayName: String = "shuffle-replicate-nested-loop"
+  override def hintAliases: Set[String] = Set(
+    "SHUFFLE_REPLICATE_NL")


This hint for cartesian products is useful for users?

Yes. In the default logic, broadcast-nl is prioritized over shuffle-replicate-nl (cartesian-product), so this can be used for special cases where shuffle-replicate-nl is favored.

I think we might need a code comment to explain SHUFFLE_REPLICATE_NL is cartesian products.

maropu · 2019-03-22T01:49:43Z

We also need to update the document;
https://spark.apache.org/docs/latest/sql-performance-tuning.html#broadcast-hint-for-sql-queries

maryannxue · 2019-03-22T02:09:52Z

@maropu

In the conflict case, we need to implicitly resolve it

We'll need to log warnings for ignored and overridden hints, like you did in #24055. So would you mind holding that off and implementing a more complete solution after this PR?

maryannxue · 2019-03-22T17:29:08Z

@maropu

We also need to update the document;
https://spark.apache.org/docs/latest/sql-performance-tuning.html#broadcast-hint-for-sql-queries

Thank you for pointing this out! I'll work on that.

maryannxue · 2019-03-22T17:29:56Z

cc @cloud-fan @hvanhovell @gatorsmile

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

SparkQA · 2019-03-26T07:00:24Z

Test build #103937 has finished for PR 24164 at commit e77c9f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-03-27T06:54:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala


-      // broadcast hints were not specified, so need to infer it from size and configuration.
+      // broadcast hints specified with no equi-join keys, use broadcast-nested-loop
+      case j @ logical.Join(left, right, joinType, condition, hint)


remove 'j @' ?

gatorsmile · 2019-03-27T07:07:51Z

How about the Dataset Hint API? For example, df.hint("broadcast")? Do we have test cases for these APIs? We also can add some test cases for complex queries (e.g., multi way join, with persistent views, and CTEs)

maryannxue · 2019-04-03T20:21:32Z

@gatorsmile Added more tests. Please review.

SparkQA · 2019-04-03T23:54:11Z

Test build #104259 has finished for PR 24164 at commit f198dfb.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-04-04T00:41:57Z

retest this please

SparkQA · 2019-04-04T04:16:09Z

Test build #104264 has finished for PR 24164 at commit f198dfb.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-04-04T22:12:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

+   *     Supports both equi-joins and non-equi-joins.
+   *     Supports only inner like joins.
+   *
+   * First, look at applicable join strategies hints:


Add based on the following precedence

cloud-fan · 2019-04-08T14:56:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

+  }

  override def toString: String = {
    val hints = scala.collection.mutable.ArrayBuffer.empty[String]


nit: we don't need to create an array buffer here.

SparkQA · 2019-04-08T17:40:06Z

Test build #104401 has finished for PR 24164 at commit 407c63f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-08T19:19:13Z

Test build #104392 has finished for PR 24164 at commit 4a13ffe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… join-hints

maryannxue · 2019-04-09T04:14:15Z

@maropu I added "hint error handling point" in the hint resolving and hint-node elimination stage, with default behavior as logging warnings. You can refine them and probably add configuration-based error handling in your other PR.

SparkQA · 2019-04-09T05:34:24Z

Test build #104418 has finished for PR 24164 at commit 6bd7f56.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-04-09T06:27:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateResolvedHint.scala

+    logWarning(s"A join hint $hint is specified but it is not part of a join relation.")
+  }
+
+  private def handleOverriddenHintInfo(hint: HintInfo): Unit = {


it's a little weird to see this method being defined twice. Can we just log the message inside HintInfo.merge?

I was thinking to have a centralized handler for all kinds of hint events/errors, and the action, whether to log warnings/errors or to throw exceptions, can be configurable. WDYT?

cloud-fan · 2019-04-09T06:47:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

-   * in this [[HintInfo]] if defined, otherwise the strategy in the other [[HintInfo]].
+   * Combine this [[HintInfo]] with another [[HintInfo]] and return the new [[HintInfo]].
+   * @param other the other [[HintInfo]]
+   * @param hintOverriddenCallback a callback to notify if any [[HintInfo]] has been overridden


if we create a hint merging strategy framework, I think it will not be an arbitrary callback. Shall we make it simple now and leave it for future design? Then we can just log message inside this method.

SparkQA · 2019-04-09T07:05:01Z

Test build #104420 has finished for PR 24164 at commit e533ac2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-10T01:08:14Z

Test build #104457 has finished for PR 24164 at commit 7342fbd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MockAppender extends AppenderSkeleton

SparkQA · 2019-04-10T07:05:01Z

Test build #104468 has finished for PR 24164 at commit 0912997.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-04-10T07:09:13Z

retest this please

SparkQA · 2019-04-10T10:25:54Z

Test build #104474 has finished for PR 24164 at commit 0912997.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-10T16:22:14Z

Test build #104481 has finished for PR 24164 at commit c0b217c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-11T07:05:02Z

Test build #104500 has finished for PR 24164 at commit a9634c4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-11T07:05:02Z

Test build #104498 has finished for PR 24164 at commit 4a48286.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-04-11T11:42:16Z

retest this please

SparkQA · 2019-04-11T16:00:36Z

Test build #104511 has finished for PR 24164 at commit a9634c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-04-11T16:14:47Z

thanks, merging to master!

## What changes were proposed in this pull request? This PR extends the existing BROADCAST join hint (for both broadcast-hash join and broadcast-nested-loop join) by implementing other join strategy hints corresponding to the rest of Spark's existing join strategies: shuffle-hash, sort-merge, cartesian-product. The hint names: SHUFFLE_MERGE, SHUFFLE_HASH, SHUFFLE_REPLICATE_NL are partly different from the code names in order to make them clearer to users and reflect the actual algorithms better. The hinted strategy will be used for the join with which it is associated if it is applicable/doable. Conflict resolving rules in case of multiple hints: 1. Conflicts within either side of the join: take the first strategy hint specified in the query, or the top hint node in Dataset. For example, in "select /*+ merge(t1) */ /*+ broadcast(t1) */ k1, v2 from t1 join t2 on t1.k1 = t2.k2", take "merge(t1)"; in ```df1.hint("merge").hint("shuffle_hash").join(df2)```, take "shuffle_hash". This is a general hint conflict resolving strategy, not specific to join strategy hint. 2. Conflicts between two sides of the join: a) In case of different strategy hints, hints are prioritized as ```BROADCAST``` over ```SHUFFLE_MERGE``` over ```SHUFFLE_HASH``` over ```SHUFFLE_REPLICATE_NL```. b) In case of same strategy hints but conflicts in build side, choose the build side based on join type and size. ## How was this patch tested? Added new UTs. Closes apache#24164 from maryannxue/join-hints. Lead-authored-by: maryannxue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

implement join strategy hints

1426294

maropu reviewed Mar 22, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala Outdated Show resolved Hide resolved

maropu reviewed Mar 22, 2019

View reviewed changes

gatorsmile reviewed Mar 24, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala Outdated Show resolved Hide resolved

gatorsmile reviewed Mar 24, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala Show resolved Hide resolved

address review comments

e77c9f3

gatorsmile reviewed Mar 27, 2019

View reviewed changes

add more tests

f198dfb

Merge remote-tracking branch 'origin/master' into join-hints

d25822d

gatorsmile reviewed Apr 4, 2019

View reviewed changes

cloud-fan reviewed Apr 8, 2019

View reviewed changes

refactor

407c63f

cloud-fan and others added 3 commits April 9, 2019 10:29

fix

6bd7f56

add hint event handling

c535d36

Merge branch 'join-hints' of https://github.com/maryannxue/spark into…

e533ac2

… join-hints

cloud-fan reviewed Apr 9, 2019

View reviewed changes

add more tests

7342fbd

fix tests

0912997

fix test

c0b217c

cloud-fan added 2 commits April 11, 2019 13:53

fix behaviors

4a48286

code cleanup

a9634c4

cloud-fan closed this in 43da473 Apr 11, 2019

cloud-fan mentioned this pull request Apr 15, 2019

[SPARK-27430][SQL] broadcast hint should be respected for broadcast nested loop join #24376

Closed

[SPARK-27225][SQL] Implement join strategy hints #24164

[SPARK-27225][SQL] Implement join strategy hints #24164

Uh oh!

Conversation

maryannxue commented Mar 21, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 21, 2019

Uh oh!

maropu commented Mar 21, 2019

Uh oh!

SparkQA commented Mar 21, 2019

Uh oh!

gatorsmile commented Mar 21, 2019

Uh oh!

gatorsmile commented Mar 21, 2019

Uh oh!

gatorsmile commented Mar 21, 2019

Uh oh!

SparkQA commented Mar 21, 2019

Uh oh!

shaneknapp commented Mar 21, 2019

Uh oh!

SparkQA commented Mar 22, 2019

Uh oh!

maropu commented Mar 22, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Mar 22, 2019

Uh oh!

maryannxue commented Mar 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maryannxue commented Mar 22, 2019

Uh oh!

maryannxue commented Mar 22, 2019

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Mar 26, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Mar 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maryannxue commented Apr 3, 2019

Uh oh!

SparkQA commented Apr 3, 2019

Uh oh!

dilipbiswal commented Apr 4, 2019

Uh oh!

SparkQA commented Apr 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 8, 2019

Uh oh!

SparkQA commented Apr 8, 2019

Uh oh!

maryannxue commented Apr 9, 2019

Uh oh!

SparkQA commented Apr 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 9, 2019

maryannxue commented Mar 22, 2019 •

edited

Loading

gatorsmile commented Mar 27, 2019 •

edited

Loading