[SPARK-30524] [SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle#27226
[SPARK-30524] [SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle#27226JkSelf wants to merge 6 commits intoapache:masterfrom
Conversation
|
@cloud-fan @hvanhovell @maryannxue Please help review if you have available time. Thanks for your help. |
|
Test build #116807 has finished for PR 27226 at commit
|
| private def containShuffleQueryStage(plan : SparkPlan): (Boolean, ShuffleQueryStageExec) = | ||
| plan match { | ||
| case stage: ShuffleQueryStageExec => (true, stage) | ||
| case sort: SortExec if (sort.child.isInstanceOf[ShuffleQueryStageExec]) => |
There was a problem hiding this comment.
nit: case SortExec(_, _, s: ShuffleQueryStageExec, _)
| private def reOptimizeChild( | ||
| skewedReader: SkewedPartitionReaderExec, | ||
| child: SparkPlan): SparkPlan = child match { | ||
| case sort: SortExec if (sort.child.isInstanceOf[ShuffleQueryStageExec]) => |
| |Try to optimize skewed join. | ||
| |Left side partition size: $leftSizeInfo | ||
| |Right side partition size: $rightSizeInfo | ||
| |Try to optimize skewed join. |
There was a problem hiding this comment.
the previous indentation seems corrected.
| s1 @ SortExec(_, _, left: ShuffleQueryStageExec, _), | ||
| s2 @ SortExec(_, _, right: ShuffleQueryStageExec, _)) | ||
| if supportedJoinTypes.contains(joinType) => | ||
| private def containShuffleQueryStage(plan : SparkPlan): (Boolean, ShuffleQueryStageExec) = |
There was a problem hiding this comment.
why not just return Option[ShuffleQueryStageExec]? we can rename the method to getShuffleQueryStage
| child: SparkPlan): SparkPlan = child match { | ||
| case sort @ SortExec(_, _, s: ShuffleQueryStageExec, _) => | ||
| sort.copy(child = skewedReader) | ||
| case _ => child |
There was a problem hiding this comment.
shouldn't this be: case _: ShuffleQueryStageExec => skewedReader?
| logDebug(s"number of skewed partitions is ${skewedPartitions.size}") | ||
| if (skewedPartitions.nonEmpty) { | ||
| val visitedStages = HashSet.empty[Int] | ||
| val optimizedSmj = smj.transformDown { |
There was a problem hiding this comment.
how about transformUp? Then we don't need the visitedStages
| val optimizedSmj = smj.transformDown { | ||
| case sort @ SortExec(_, _, shuffleStage: ShuffleQueryStageExec, _) => | ||
| sort.copy(child = PartialShuffleReaderExec(shuffleStage, skewedPartitions.toSet)) | ||
| case shuffleStage: ShuffleQueryStageExec if !visitedStages.contains(shuffleStage.id) => |
There was a problem hiding this comment.
to be safe, we should do case s: ShuffleQueryStageExec if s.id == left.id || s.id == right.id
| } | ||
| } | ||
|
|
||
| def handleSkewJoin(plan: SparkPlan): SparkPlan = { |
There was a problem hiding this comment.
this is not a long method, maybe just inline it in apply?
| } | ||
| } | ||
|
|
||
| test("SPARK-30524: AQE should disable OptimizeSkewedJoin rule" + |
There was a problem hiding this comment.
nit: SPARK-30524: Do not optimize skew join if introduce additional shuffle
|
Test build #116810 has finished for PR 27226 at commit
|
|
Test build #116816 has finished for PR 27226 at commit
|
|
Test build #116819 has finished for PR 27226 at commit
|
|
Test build #116808 has finished for PR 27226 at commit
|
|
retest this please |
|
Test build #116835 has finished for PR 27226 at commit
|
|
thanks, merging to master! |
| handleSkewJoin(plan) | ||
| // When multi table join, there will be too many complex combination to consider. | ||
| // Currently we only handle 2 table join like following two use cases. | ||
| // SMJ SMJ |
There was a problem hiding this comment.
Sorry that my previous comment was wrong. Once we have shuffle, there should always be a sort. So we don't need to match this.
| s"${getClass.getSimpleName} should not take $x as the JoinType") | ||
| } | ||
|
|
||
| override def requiredChildDistribution: Seq[Distribution] = |
There was a problem hiding this comment.
We should probably make this a flag to indicate it's a partial SMJ. This whole matching is too tightly coupled with the skew join rule itself.
|
@JkSelf can you do a quick follow up for the comments above as well as this one: |
### What changes were proposed in this pull request? Resolve the remaining comments in [PR#27226](#27226). ### Why are the changes needed? Resolve the comments. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests. Closes #27253 from JkSelf/followup-skewjoinoptimization2. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
OptimizeSkewedJoinrule change theoutputPartitioningafter insertingPartialShuffleReaderExecorSkewedPartitionReaderExec. So it may need to introduce additional to ensure the right result. This PR disableOptimizeSkewedJoinrule when introducing additional shuffle.Why are the changes needed?
bug fix
Does this PR introduce any user-facing change?
No
How was this patch tested?
Add new ut