Skip to content

Comments

[SPARK-30524] [SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle#27226

Closed
JkSelf wants to merge 6 commits intoapache:masterfrom
JkSelf:followup-skewedoptimization
Closed

[SPARK-30524] [SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle#27226
JkSelf wants to merge 6 commits intoapache:masterfrom
JkSelf:followup-skewedoptimization

Conversation

@JkSelf
Copy link

@JkSelf JkSelf commented Jan 16, 2020

What changes were proposed in this pull request?

OptimizeSkewedJoin rule change the outputPartitioning after inserting PartialShuffleReaderExec or SkewedPartitionReaderExec. So it may need to introduce additional to ensure the right result. This PR disable OptimizeSkewedJoin rule when introducing additional shuffle.

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add new ut

@JkSelf JkSelf changed the title Disable OptimizeSkewedJoin rule when introducing additional shuffle [SPARK-30524] [SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle Jan 16, 2020
@JkSelf
Copy link
Author

JkSelf commented Jan 16, 2020

@cloud-fan @hvanhovell @maryannxue Please help review if you have available time. Thanks for your help.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116807 has finished for PR 27226 at commit 4bed602.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private def containShuffleQueryStage(plan : SparkPlan): (Boolean, ShuffleQueryStageExec) =
plan match {
case stage: ShuffleQueryStageExec => (true, stage)
case sort: SortExec if (sort.child.isInstanceOf[ShuffleQueryStageExec]) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: case SortExec(_, _, s: ShuffleQueryStageExec, _)

private def reOptimizeChild(
skewedReader: SkewedPartitionReaderExec,
child: SparkPlan): SparkPlan = child match {
case sort: SortExec if (sort.child.isInstanceOf[ShuffleQueryStageExec]) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

|Try to optimize skewed join.
|Left side partition size: $leftSizeInfo
|Right side partition size: $rightSizeInfo
|Try to optimize skewed join.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous indentation seems corrected.

s1 @ SortExec(_, _, left: ShuffleQueryStageExec, _),
s2 @ SortExec(_, _, right: ShuffleQueryStageExec, _))
if supportedJoinTypes.contains(joinType) =>
private def containShuffleQueryStage(plan : SparkPlan): (Boolean, ShuffleQueryStageExec) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just return Option[ShuffleQueryStageExec]? we can rename the method to getShuffleQueryStage

child: SparkPlan): SparkPlan = child match {
case sort @ SortExec(_, _, s: ShuffleQueryStageExec, _) =>
sort.copy(child = skewedReader)
case _ => child
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be: case _: ShuffleQueryStageExec => skewedReader?

logDebug(s"number of skewed partitions is ${skewedPartitions.size}")
if (skewedPartitions.nonEmpty) {
val visitedStages = HashSet.empty[Int]
val optimizedSmj = smj.transformDown {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about transformUp? Then we don't need the visitedStages

val optimizedSmj = smj.transformDown {
case sort @ SortExec(_, _, shuffleStage: ShuffleQueryStageExec, _) =>
sort.copy(child = PartialShuffleReaderExec(shuffleStage, skewedPartitions.toSet))
case shuffleStage: ShuffleQueryStageExec if !visitedStages.contains(shuffleStage.id) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be safe, we should do case s: ShuffleQueryStageExec if s.id == left.id || s.id == right.id

}
}

def handleSkewJoin(plan: SparkPlan): SparkPlan = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a long method, maybe just inline it in apply?

}
}

test("SPARK-30524: AQE should disable OptimizeSkewedJoin rule" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: SPARK-30524: Do not optimize skew join if introduce additional shuffle

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116810 has finished for PR 27226 at commit 903309f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116816 has finished for PR 27226 at commit 81ad999.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116819 has finished for PR 27226 at commit c37f397.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116808 has finished for PR 27226 at commit c1c05d4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116835 has finished for PR 27226 at commit c37f397.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 6e5b4bf Jan 16, 2020
handleSkewJoin(plan)
// When multi table join, there will be too many complex combination to consider.
// Currently we only handle 2 table join like following two use cases.
// SMJ SMJ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that my previous comment was wrong. Once we have shuffle, there should always be a sort. So we don't need to match this.

s"${getClass.getSimpleName} should not take $x as the JoinType")
}

override def requiredChildDistribution: Seq[Distribution] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably make this a flag to indicate it's a partial SMJ. This whole matching is too tightly coupled with the skew join rule itself.

@maryannxue
Copy link
Contributor

@JkSelf can you do a quick follow up for the comments above as well as this one:
https://github.com/apache/spark/pull/26434/files#r367506078 ?

cloud-fan pushed a commit that referenced this pull request Jan 19, 2020
### What changes were proposed in this pull request?
Resolve the remaining comments in [PR#27226](#27226).

### Why are the changes needed?
Resolve the comments.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing unit tests.

Closes #27253 from JkSelf/followup-skewjoinoptimization2.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants