Skip to content

Conversation

@maryannxue
Copy link
Contributor

What changes were proposed in this pull request?

This PR extends the existing BROADCAST join hint (for both broadcast-hash join and broadcast-nested-loop join) by implementing other join strategy hints corresponding to the rest of Spark's existing join strategies: shuffle-hash, sort-merge, cartesian-product. The hint names: SHUFFLE_MERGE, SHUFFLE_HASH, SHUFFLE_REPLICATE_NL are partly different from the code names in order to make them clearer to users and reflect the actual algorithms better.

The hinted strategy will be used for the join with which it is associated if it is applicable/doable.

Conflict resolving rules in case of multiple hints:

  1. Conflicts within either side of the join: take the first strategy hint specified in the query, or the top hint node in Dataset. For example, in "select /*+ merge(t1) / /+ broadcast(t1) */ k1, v2 from t1 join t2 on t1.k1 = t2.k2", take "merge(t1)"; in df1.hint("merge").hint("shuffle_hash").join(df2), take "shuffle_hash". This is a general hint conflict resolving strategy, not specific to join strategy hint.
  2. Conflicts between two sides of the join:
    a) In case of different strategy hints, hints are prioritized as BROADCAST over SHUFFLE_MERGE over SHUFFLE_HASH over SHUFFLE_REPLICATE_NL.
    b) In case of same strategy hints but conflicts in build side, choose the build side based on join type and size.

How was this patch tested?

Added new UTs.

@SparkQA
Copy link

SparkQA commented Mar 21, 2019

Test build #103754 has finished for PR 24164 at commit 1426294.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ResolveJoinStrategyHints(conf: SQLConf) extends Rule[LogicalPlan]
  • case class HintInfo(strategy: Option[JoinStrategyHint] = None)
  • sealed abstract class JoinStrategyHint

@maropu
Copy link
Member

maropu commented Mar 21, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Mar 21, 2019

Test build #103759 has finished for PR 24164 at commit 1426294.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ResolveJoinStrategyHints(conf: SQLConf) extends Rule[LogicalPlan]
  • case class HintInfo(strategy: Option[JoinStrategyHint] = None)
  • sealed abstract class JoinStrategyHint

@gatorsmile
Copy link
Member

retest this please

@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

test this please

@SparkQA
Copy link

SparkQA commented Mar 21, 2019

Test build #103776 has started for PR 24164 at commit 1426294.

@shaneknapp
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Mar 22, 2019

Test build #103777 has finished for PR 24164 at commit 1426294.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ResolveJoinStrategyHints(conf: SQLConf) extends Rule[LogicalPlan]
  • case class HintInfo(strategy: Option[JoinStrategyHint] = None)
  • sealed abstract class JoinStrategyHint

@maropu
Copy link
Member

maropu commented Mar 22, 2019

In the conflict case, we need to implicitly resolve it? In case of complicated queries, it seems to become difficult that uses understand the hint behaviours, I think.

case object SHUFFLE_REPLICATE_NL extends JoinStrategyHint {
override def displayName: String = "shuffle-replicate-nested-loop"
override def hintAliases: Set[String] = Set(
"SHUFFLE_REPLICATE_NL")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hint for cartesian products is useful for users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In the default logic, broadcast-nl is prioritized over shuffle-replicate-nl (cartesian-product), so this can be used for special cases where shuffle-replicate-nl is favored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need a code comment to explain SHUFFLE_REPLICATE_NL is cartesian products.

@maropu
Copy link
Member

maropu commented Mar 22, 2019

@maryannxue
Copy link
Contributor Author

maryannxue commented Mar 22, 2019

@maropu

In the conflict case, we need to implicitly resolve it

We'll need to log warnings for ignored and overridden hints, like you did in #24055. So would you mind holding that off and implementing a more complete solution after this PR?

@maryannxue
Copy link
Contributor Author

@maropu

We also need to update the document;
https://spark.apache.org/docs/latest/sql-performance-tuning.html#broadcast-hint-for-sql-queries

Thank you for pointing this out! I'll work on that.

@maryannxue
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Mar 26, 2019

Test build #103937 has finished for PR 24164 at commit e77c9f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


// broadcast hints were not specified, so need to infer it from size and configuration.
// broadcast hints specified with no equi-join keys, use broadcast-nested-loop
case j @ logical.Join(left, right, joinType, condition, hint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove 'j @' ?

@gatorsmile
Copy link
Member

gatorsmile commented Mar 27, 2019

How about the Dataset Hint API? For example, df.hint("broadcast")? Do we have test cases for these APIs? We also can add some test cases for complex queries (e.g., multi way join, with persistent views, and CTEs)

@maryannxue
Copy link
Contributor Author

@gatorsmile Added more tests. Please review.

@SparkQA
Copy link

SparkQA commented Apr 3, 2019

Test build #104259 has finished for PR 24164 at commit f198dfb.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Apr 4, 2019

Test build #104264 has finished for PR 24164 at commit f198dfb.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Supports both equi-joins and non-equi-joins.
* Supports only inner like joins.
*
* First, look at applicable join strategies hints:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add based on the following precedence

}

override def toString: String = {
val hints = scala.collection.mutable.ArrayBuffer.empty[String]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need to create an array buffer here.

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104401 has finished for PR 24164 at commit 407c63f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104392 has finished for PR 24164 at commit 4a13ffe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maryannxue
Copy link
Contributor Author

@maropu I added "hint error handling point" in the hint resolving and hint-node elimination stage, with default behavior as logging warnings. You can refine them and probably add configuration-based error handling in your other PR.

@SparkQA
Copy link

SparkQA commented Apr 9, 2019

Test build #104418 has finished for PR 24164 at commit 6bd7f56.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

logWarning(s"A join hint $hint is specified but it is not part of a join relation.")
}

private def handleOverriddenHintInfo(hint: HintInfo): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a little weird to see this method being defined twice. Can we just log the message inside HintInfo.merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to have a centralized handler for all kinds of hint events/errors, and the action, whether to log warnings/errors or to throw exceptions, can be configurable. WDYT?

* in this [[HintInfo]] if defined, otherwise the strategy in the other [[HintInfo]].
* Combine this [[HintInfo]] with another [[HintInfo]] and return the new [[HintInfo]].
* @param other the other [[HintInfo]]
* @param hintOverriddenCallback a callback to notify if any [[HintInfo]] has been overridden
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we create a hint merging strategy framework, I think it will not be an arbitrary callback. Shall we make it simple now and leave it for future design? Then we can just log message inside this method.

@SparkQA
Copy link

SparkQA commented Apr 9, 2019

Test build #104420 has finished for PR 24164 at commit e533ac2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 10, 2019

Test build #104457 has finished for PR 24164 at commit 7342fbd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MockAppender extends AppenderSkeleton

@SparkQA
Copy link

SparkQA commented Apr 10, 2019

Test build #104468 has finished for PR 24164 at commit 0912997.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Apr 10, 2019

Test build #104474 has finished for PR 24164 at commit 0912997.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 10, 2019

Test build #104481 has finished for PR 24164 at commit c0b217c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 11, 2019

Test build #104500 has finished for PR 24164 at commit a9634c4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 11, 2019

Test build #104498 has finished for PR 24164 at commit 4a48286.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Apr 11, 2019

Test build #104511 has finished for PR 24164 at commit a9634c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 43da473 Apr 11, 2019
j-baker pushed a commit to palantir/spark that referenced this pull request Jan 25, 2020
## What changes were proposed in this pull request?

This PR extends the existing BROADCAST join hint (for both broadcast-hash join and broadcast-nested-loop join) by implementing other join strategy hints corresponding to the rest of Spark's existing join strategies: shuffle-hash, sort-merge, cartesian-product. The hint names: SHUFFLE_MERGE, SHUFFLE_HASH, SHUFFLE_REPLICATE_NL are partly different from the code names in order to make them clearer to users and reflect the actual algorithms better.

The hinted strategy will be used for the join with which it is associated if it is applicable/doable.

Conflict resolving rules in case of multiple hints:
1. Conflicts within either side of the join: take the first strategy hint specified in the query, or the top hint node in Dataset. For example, in "select /*+ merge(t1) */ /*+ broadcast(t1) */ k1, v2 from t1 join t2 on t1.k1 = t2.k2", take "merge(t1)"; in ```df1.hint("merge").hint("shuffle_hash").join(df2)```, take "shuffle_hash". This is a general hint conflict resolving strategy, not specific to join strategy hint.
2. Conflicts between two sides of the join:
  a) In case of different strategy hints, hints are prioritized as ```BROADCAST``` over ```SHUFFLE_MERGE``` over ```SHUFFLE_HASH``` over ```SHUFFLE_REPLICATE_NL```.
  b) In case of same strategy hints but conflicts in build side, choose the build side based on join type and size.

## How was this patch tested?

Added new UTs.

Closes apache#24164 from maryannxue/join-hints.

Lead-authored-by: maryannxue <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants