[SPARK-32083][SQL] Coalesce to one partition when all partitions are empty in AQE #28916

manuzhang · 2020-06-24T07:44:18Z

What changes were proposed in this pull request?

This PR creates one partition spec in ShufflePartitionsUtil if all inputs are empty, which avoids launching as many unnecessary tasks as the number of shuffle partitions for following stages.

Why are the changes needed?

For SQL like

SELECT b, COUNT(t1.a) as cnt
FROM t1
INNER JOIN t2
ON t1.id = t2.id
WHERE t1.id > 10
GROUP BY b

when all ids of t1 are smaller than 10. No tasks are launched for join since empty input is coalesced to 0 partition. However, many unnecessary tasks could be launched for the following aggregate execution. Hence, I'm proposing coalescing to one partition when all partitions are empty.

Before

After

Does this PR introduce any user-facing change?

No.

How was this patch tested?

updated tests.

…empty in AQE

manuzhang · 2020-06-24T09:20:59Z

cc @cloud-fan @maryannxue @JkSelf

cloud-fan · 2020-06-24T09:34:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala

+    def createPartitionSpec(last: Boolean = false): Unit = {
+      // Skip empty inputs, as it is a waste to launch an empty task
+      // unless all inputs are empty
+      if (coalescedSize > 0 || (last && partitionSpecs.isEmpty)) {


so you want to create at least one partition? This doesn't match the PR description.

Yes, one partition if all partitions are empty. This creates one partition spec at last when no partition specs have been created.

SparkQA · 2020-06-24T12:52:35Z

Test build #124473 has finished for PR 28916 at commit 84031c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-24T14:54:12Z

which avoids launching as many unnecessary tasks as spark.sql.adaptive.coalescePartitions.initialPartitionNum

Why no partition would cause this?

manuzhang · 2020-06-24T23:58:54Z

IIUC, stages after coalescing will be submitted in a separate job with default number of partitions when the input is 0
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L68

dongjoon-hyun · 2020-06-28T03:17:53Z

Retest this please.

SparkQA · 2020-06-28T07:05:02Z

Test build #124588 has finished for PR 28916 at commit 84031c1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-06-28T19:36:27Z

Retest this please

viirya · 2020-06-28T20:57:16Z

Same question. When partitionSpecs is empty, I think CustomShuffleReaderExec produces a RDD with empty partition. Why many unnecessary tasks?

SparkQA · 2020-06-29T00:31:38Z

Test build #124607 has finished for PR 28916 at commit 84031c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

manuzhang · 2020-06-29T03:44:14Z

@viirya

It's because ShuffleRowedRDD is created with default number of shuffle partitions here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L98

cloud-fan · 2020-06-29T04:03:32Z

Ideally we should launch no task for empty partitions. Launching one task is still not the best solution.

viirya · 2020-06-29T04:33:45Z

It's because ShuffleRowedRDD is created with default number of shuffle partitions here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L98

When partitionSpecs is empty, CustomShuffleReaderExec creates ShuffledRowRDD with empty partitionSpecs.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

Lines 183 to 184 in 079b362

    
           new ShuffledRowRDD( 
        
             stage.shuffle.shuffleDependency, stage.shuffle.readMetrics, partitionSpecs.toArray)

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala

Lines 156 to 160 in 34c7ec8

    
           override def getPartitions: Array[Partition] = { 
        
             Array.tabulate[Partition](partitionSpecs.length) { i => 
        
               ShuffledRowRDDPartition(i, partitionSpecs(i)) 
        
             } 
        
           }

The shuffle is changed by AQE and CustomShuffleReaderExec will replace original ShuffleExchangeExec, I think the code you pointed is replaced by above code path. So it should be empty partitions.

cloud-fan · 2020-06-29T04:51:09Z

@manuzhang can you check the Spark web UI and make sure AQE does launch tasks for empty partitions?

manuzhang · 2020-06-29T05:15:51Z

@cloud-fan Yes, also from the UT log before this change (I enabled the lineage log)

===== TEST OUTPUT FOR o.a.s.sql.execution.adaptive.AdaptiveQueryExecSuite: 'Empty stage coalesced to 0-partition RDD' =====
...
22:06:00.535 ScalaTest-run-running-AdaptiveQueryExecSuite INFO ShufflePartitionsUtil: For shuffle(0, 1), advisory target size: 67108864, actual target size 16.
22:06:00.683 ScalaTest-run-running-AdaptiveQueryExecSuite INFO CodeGenerator: Code generated in 88.071331 ms
22:06:00.700 ScalaTest-run-running-AdaptiveQueryExecSuite INFO CodeGenerator: Code generated in 11.333313 ms
22:06:00.752 ScalaTest-run-running-AdaptiveQueryExecSuite INFO CodeGenerator: Code generated in 9.213245 ms
22:06:00.801 ScalaTest-run-running-AdaptiveQueryExecSuite INFO CodeGenerator: Code generated in 12.257591 ms
22:06:00.855 ScalaTest-run-running-AdaptiveQueryExecSuite INFO SparkContext: Starting job: apply at OutcomeOf.scala:85
22:06:00.858 ScalaTest-run-running-AdaptiveQueryExecSuite INFO SparkContext: RDD's recursive dependencies:
(5) MapPartitionsRDD[43] at apply at OutcomeOf.scala:85 []
 |  SQLExecutionRDD[42] at apply at OutcomeOf.scala:85 []
 |  MapPartitionsRDD[41] at apply at OutcomeOf.scala:85 []
 |  MapPartitionsRDD[40] at apply at OutcomeOf.scala:85 []
 |  ShuffledRowRDD[39] at apply at OutcomeOf.scala:85 []
 +-(0) MapPartitionsRDD[38] at apply at OutcomeOf.scala:85 []
    |  MapPartitionsRDD[37] at apply at OutcomeOf.scala:85 []
    |  ZippedPartitionsRDD2[36] at apply at OutcomeOf.scala:85 []
    |  MapPartitionsRDD[33] at apply at OutcomeOf.scala:85 []
    |  ShuffledRowRDD[32] at apply at OutcomeOf.scala:85 []
    +-(2) MapPartitionsRDD[27] at apply at OutcomeOf.scala:85 []
       |  MapPartitionsRDD[26] at apply at OutcomeOf.scala:85 []
       |  MapPartitionsRDD[25] at apply at OutcomeOf.scala:85 []
       |  ParallelCollectionRDD[24] at apply at OutcomeOf.scala:85 []
    |  MapPartitionsRDD[35] at apply at OutcomeOf.scala:85 []
    |  ShuffledRowRDD[34] at apply at OutcomeOf.scala:85 []
    +-(2) MapPartitionsRDD[31] at apply at OutcomeOf.scala:85 []
       |  MapPartitionsRDD[30] at apply at OutcomeOf.scala:85 []
       |  MapPartitionsRDD[29] at apply at OutcomeOf.scala:85 []
       |  ParallelCollectionRDD[28] at apply at OutcomeOf.scala:85 []
22:06:00.860 dag-scheduler-event-loop INFO DAGScheduler: Registering RDD 38 (apply at OutcomeOf.scala:85) as input to shuffle 2
22:06:00.861 dag-scheduler-event-loop INFO DAGScheduler: Got job 2 (apply at OutcomeOf.scala:85) with 5 output partitions
22:06:00.861 dag-scheduler-event-loop INFO DAGScheduler: Final stage: ResultStage 5 (apply at OutcomeOf.scala:85)
22:06:00.861 dag-scheduler-event-loop INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 4)
22:06:00.862 dag-scheduler-event-loop INFO DAGScheduler: Missing parents: List()
22:06:00.863 dag-scheduler-event-loop INFO DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[43] at apply at OutcomeOf.scala:85), which has no missing parents

cloud-fan · 2020-06-29T05:40:07Z

@manuzhang can you check the web UI as well?

manuzhang · 2020-06-29T05:54:48Z

@cloud-fan here it is

cloud-fan · 2020-06-29T06:07:49Z

I checked the related code and came up with the same conclusion as @viirya . Can you elaborate more about how this happens?

manuzhang · 2020-06-29T07:34:11Z

The above log is from this UT.

df1.where('a > 10).join(df2.where('b > 10), "id").groupBy('a).count()

I was not saying that tasks were launched for the stage of coalesced empty partition but the stage consuming the output of the empty partition, which I believe is the execution of groupBy('a).count() part.

manuzhang · 2020-06-29T22:43:42Z

@viirya @cloud-fan
I've updated the PR description with an example. This is more of an improvement I propose for certain cases. Please let me know whether it makes sense.

cloud-fan · 2020-06-30T04:35:17Z

I think the key problem is we skip CoalesceShufflePartitions when ShuffleQueryStageExec#mapStats is None. This can happen when the input RDD of the shuffle has 0 partitions. I think we should still apply CoalesceShufflePartitions in this case and wrap ShuffleQueryStageExec with CustomShuffleReaderExec with partitionSpecs as Nil.

manuzhang · 2020-06-30T08:31:41Z

Thanks for pointing that out. Let me try with a new PR.

manuzhang · 2020-06-30T11:57:45Z

@cloud-fan @viirya Please help review the new PR #28954.

### What changes were proposed in this pull request? This PR updates the AQE framework to at least return one partition during coalescing. This PR also updates `ShuffleExchangeExec.canChangeNumPartitions` to not coalesce for `SinglePartition`. ### Why are the changes needed? It's a bit risky to return 0 partitions, as sometimes it's different from empty data. For example, global aggregate will return one result row even if the input table is empty. If there is 0 partition, no task will be run and no result will be returned. More specifically, the global aggregate requires `AllTuples` and we can't coalesce to 0 partitions. This is not a real bug for now. The global aggregate will be planned as partial and final physical agg nodes. The partial agg will return at least one row, so that the shuffle still have data. But it's better to fix this issue to avoid potential bugs in the future. According to #28916, this change also fix some perf problems. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated test. Closes #29307 from cloud-fan/aqe. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

This PR updates the AQE framework to at least return one partition during coalescing. This PR also updates `ShuffleExchangeExec.canChangeNumPartitions` to not coalesce for `SinglePartition`. It's a bit risky to return 0 partitions, as sometimes it's different from empty data. For example, global aggregate will return one result row even if the input table is empty. If there is 0 partition, no task will be run and no result will be returned. More specifically, the global aggregate requires `AllTuples` and we can't coalesce to 0 partitions. This is not a real bug for now. The global aggregate will be planned as partial and final physical agg nodes. The partial agg will return at least one row, so that the shuffle still have data. But it's better to fix this issue to avoid potential bugs in the future. According to apache#28916, this change also fix some perf problems. no updated test. Closes apache#29307 from cloud-fan/aqe. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR updates the AQE framework to at least return one partition during coalescing. This PR also updates `ShuffleExchangeExec.canChangeNumPartitions` to not coalesce for `SinglePartition`. ### Why are the changes needed? It's a bit risky to return 0 partitions, as sometimes it's different from empty data. For example, global aggregate will return one result row even if the input table is empty. If there is 0 partition, no task will be run and no result will be returned. More specifically, the global aggregate requires `AllTuples` and we can't coalesce to 0 partitions. This is not a real bug for now. The global aggregate will be planned as partial and final physical agg nodes. The partial agg will return at least one row, so that the shuffle still have data. But it's better to fix this issue to avoid potential bugs in the future. According to apache/spark#28916, this change also fix some perf problems. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated test. Closes #29307 from cloud-fan/aqe. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

This PR updates the AQE framework to at least return one partition during coalescing. This PR also updates `ShuffleExchangeExec.canChangeNumPartitions` to not coalesce for `SinglePartition`. It's a bit risky to return 0 partitions, as sometimes it's different from empty data. For example, global aggregate will return one result row even if the input table is empty. If there is 0 partition, no task will be run and no result will be returned. More specifically, the global aggregate requires `AllTuples` and we can't coalesce to 0 partitions. This is not a real bug for now. The global aggregate will be planned as partial and final physical agg nodes. The partial agg will return at least one row, so that the shuffle still have data. But it's better to fix this issue to avoid potential bugs in the future. According to apache#28916, this change also fix some perf problems. no updated test. Closes apache#29307 from cloud-fan/aqe. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-32083][SQL] Coalesce to one partition when all partitions are …

84031c1

…empty in AQE

probot-autolabeler bot added the SQL label Jun 24, 2020

cloud-fan reviewed Jun 24, 2020

View reviewed changes

manuzhang mentioned this pull request Jun 30, 2020

[SPARK-32083][SQL] Apply CoalesceShufflePartitions when input RDD has 0 partitions with AQE #28954

Closed

manuzhang closed this Jun 30, 2020

cloud-fan mentioned this pull request Jul 30, 2020

[SPARK-32083][SQL] AQE coalesce should at least return one partition #29307

Closed

[SPARK-32083][SQL] Coalesce to one partition when all partitions are empty in AQE #28916

[SPARK-32083][SQL] Coalesce to one partition when all partitions are empty in AQE #28916

Uh oh!

Conversation

manuzhang commented Jun 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Before

After

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

manuzhang commented Jun 24, 2020

Uh oh!

cloud-fan Jun 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

manuzhang Jun 24, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 24, 2020

Uh oh!

cloud-fan commented Jun 24, 2020

Uh oh!

manuzhang commented Jun 24, 2020

Uh oh!

dongjoon-hyun commented Jun 28, 2020

Uh oh!

SparkQA commented Jun 28, 2020

Uh oh!

dongjoon-hyun commented Jun 28, 2020

Uh oh!

viirya commented Jun 28, 2020

Uh oh!

SparkQA commented Jun 29, 2020

Uh oh!

manuzhang commented Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jun 29, 2020

Uh oh!

viirya commented Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jun 29, 2020

Uh oh!

manuzhang commented Jun 29, 2020

Uh oh!

cloud-fan commented Jun 29, 2020

Uh oh!

manuzhang commented Jun 29, 2020

Uh oh!

cloud-fan commented Jun 29, 2020

Uh oh!

manuzhang commented Jun 29, 2020

Uh oh!

manuzhang commented Jun 29, 2020

Uh oh!

cloud-fan commented Jun 30, 2020

Uh oh!

manuzhang commented Jun 30, 2020

Uh oh!

manuzhang commented Jun 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

manuzhang commented Jun 24, 2020 •

edited

Loading

cloud-fan Jun 24, 2020 •

edited

Loading

manuzhang commented Jun 29, 2020 •

edited

Loading

viirya commented Jun 29, 2020 •

edited

Loading