[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE #30998

zhongyu09 · 2021-01-03T12:44:53Z

What changes were proposed in this pull request?

In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
It can make sure the broadcast job are submitted before map jobs to avoid waiting for job schedule and cause broadcast timeout.

Why are the changes needed?

When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Add UT
Test the code using dev environment in https://issues.apache.org/jira/browse/SPARK-33933

… timeout in AQE

HyukjinKwon · 2021-01-04T00:51:10Z

cc @maryannxue and @cloud-fan FYI

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

zhongyu09 · 2021-01-06T07:48:34Z

Hi @cloud-fan @viirya any other concerns? Can you approve the test for this PR (The old PR
#30962 test pass)

sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala

cloud-fan · 2021-01-06T08:05:25Z

ok to test

viirya

looks okay.

SparkQA · 2021-01-06T09:07:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38314/

SparkQA · 2021-01-06T09:21:25Z

Test build #133726 has finished for PR 30998 at commit 83e3e4e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-06T09:39:06Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38314/

SparkQA · 2021-01-06T10:01:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38317/

SparkQA · 2021-01-06T10:42:48Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38317/

SparkQA · 2021-01-06T13:27:21Z

Test build #133729 has finished for PR 30998 at commit 7cbeb14.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-01-06T13:53:09Z

retest this please

cloud-fan · 2021-01-06T13:53:26Z

let's see if the new test is flaky or not.

SparkQA · 2021-01-06T15:00:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38336/

SparkQA · 2021-01-06T15:26:46Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38336/

SparkQA · 2021-01-06T19:06:22Z

Test build #133748 has finished for PR 30998 at commit 7cbeb14.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-01-07T08:59:24Z

thanks, merging to master/3.1!

…adcast timeout in AQE ### What changes were proposed in this pull request? In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others. It can make sure the broadcast job are submitted before map jobs to avoid waiting for job schedule and cause broadcast timeout. ### Why are the changes needed? When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475). The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? 1. Add UT 2. Test the code using dev environment in https://issues.apache.org/jira/browse/SPARK-33933 Closes #30998 from zhongyu09/aqe-broadcast. Authored-by: Yu Zhong <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d36cdd5) Signed-off-by: Wenchen Fan <[email protected]>

zhongyu09 · 2021-01-07T12:56:58Z

Thanks @cloud-fan, why not merge to branch 3.0?

cloud-fan · 2021-01-07T14:13:22Z

@zhongyu09 can you open a backport PR for 3.0? There are many AQE code changes in this release.

zhongyu09 · 2021-01-07T14:19:45Z

@zhongyu09 can you open a backport PR for 3.0? There are many AQE code changes in this release.

Sure, just like the old one #30962 right ?

I am puzzled why not directly merge to branch 3.0?

cloud-fan · 2021-01-07T14:23:38Z

3.0 code base is very different from 3.1, and I'm afraid the test may fail. It's safer to make a PR to make sure all tests pass.

zhongyu09 · 2021-01-07T14:53:28Z

3.0 code base is very different from 3.1, and I'm afraid the test may fail. It's safer to make a PR to make sure all tests pass.

#31084

LuciferYang · 2021-01-08T06:23:31Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

    }
  }
+
+  test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {


It seems that this case fails a little frequently：

https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-2.7-jdk-11/95/

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2/1854/

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7-jdk-11-scala-2.13/147/#showFailuresLink

https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2/80/

@zhongyu09 @cloud-fan @viirya

Hmm, so it is still flaky.

I try the test tens of times and the test failed twice. As we discussed, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread. So there's still risk for the shuffle map job schedule earlier before broadcast job. I wonder should we need to remove the UT until we thorough resolve the issue.

CC @cloud-fan @dongjoon-hyun

I run SPARK-33933: AQE broadcast should not timeout with slow map tasks in local 5 times and all failed as follow:

- SPARK-33933: AQE broadcast should not timeout with slow map tasks *** FAILED *** 1751 was not greater than 2000 (AdaptiveQueryExecSuite.scala:1454)

Hmm, so it is still flaky.

yes

the spark conf changed to local[2] and so the running times are faster than before.

This shows the test is unreliable...

Checking the Spark jobs submission order should be easy to do and fast to run, and with retry it should be unlikely to fail. It's better to check stage submission order directly, if we can figure out how to do it.

Yes I know, the failure reported by @LuciferYang is easy to solve.
But the question is, the jobs submission order may be not correct, like the test in #31084.

@cloud-fan You mean with retry it should be unlikely to fail to solve the edge case?

I am trying to create a new PR. Sorry for the inconvenience.

We can get the stage submission time using SparkListeneer. But after trying serval times, the stage submission time is not stable thus the UT cannot always passed. I suggest to remove the UT before we completely solve the issue. #31099

HyukjinKwon · 2021-01-10T04:51:16Z

Guys, let me revert this one. This causes test failure too often, and it blocks RC preparation.

The flakiness is more obvious when you check the jobs here: https://amplab.cs.berkeley.edu/jenkins/, and this blocks for me to check the test results from PySpark or SparkR at least for the RC.

viirya · 2021-01-10T05:55:10Z

+1 for reverting it.

zhongyu09 · 2021-01-10T14:22:21Z

+1

zhongyu09 · 2021-01-13T08:56:15Z

Create another PR #31167

HyukjinKwon · 2021-01-14T04:10:55Z

Thanks @zhongyu09

…oid broadcast timeout in AQE ### What changes were proposed in this pull request? This PR is the same as #30998, but with a better UT. In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others. This partial fix only grantee the start of materialization for BroadcastQueryStage is prior to others, but because the submission of collect job for broadcasting is run in another thread, the issue is not completely solved. ### Why are the changes needed? When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map stage(job) and broadcast job are submitted almost at the same time, but map stage will hold all the computing resources. If the map stage runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475). The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small. The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread. 1. for broadcast job, call doPrepare() in main thread, and then start the real materialization in "broadcast-exchange-0" thread pool: calling getByteArrayRdd().collect() to submit collect job 2. for shuffle map job, call ShuffleExchangeExec.mapOutputStatisticsFuture() which call sparkContext.submitMapStage() directly in main thread to submit map stage 1 is trigger before 2, so in normal cases, the broadcast job will be submit first. However, we can not control how fast the two thread runs, so the "broadcast-exchange-0" thread could run a little bit slower than main thread, result in map stage submit first. So there's still risk for the shuffle map job schedule earlier before broadcast job. Since completely fix the issue is complex and might introduce major changes, we need more time to follow up. This partial fix is better than do nothing, it resolved most cases in SPARK-33933. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Add UT Closes #31269 from zhongyu09/aqe-broadcast-partial-fix. Authored-by: Yu Zhong <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…oid broadcast timeout in AQE ### What changes were proposed in this pull request? This PR is the same as apache#30998, but with a better UT. In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others. This partial fix only grantee the start of materialization for BroadcastQueryStage is prior to others, but because the submission of collect job for broadcasting is run in another thread, the issue is not completely solved. ### Why are the changes needed? When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map stage(job) and broadcast job are submitted almost at the same time, but map stage will hold all the computing resources. If the map stage runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475). The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small. The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread. 1. for broadcast job, call doPrepare() in main thread, and then start the real materialization in "broadcast-exchange-0" thread pool: calling getByteArrayRdd().collect() to submit collect job 2. for shuffle map job, call ShuffleExchangeExec.mapOutputStatisticsFuture() which call sparkContext.submitMapStage() directly in main thread to submit map stage 1 is trigger before 2, so in normal cases, the broadcast job will be submit first. However, we can not control how fast the two thread runs, so the "broadcast-exchange-0" thread could run a little bit slower than main thread, result in map stage submit first. So there's still risk for the shuffle map job schedule earlier before broadcast job. Since completely fix the issue is complex and might introduce major changes, we need more time to follow up. This partial fix is better than do nothing, it resolved most cases in SPARK-33933. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Add UT Closes apache#31269 from zhongyu09/aqe-broadcast-partial-fix. Authored-by: Yu Zhong <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

SPARK-33933: materialize BroadcastQueryState first to avoid broadcast…

25be218

… timeout in AQE

github-actions bot added the SQL label Jan 3, 2021

zhongyu09 mentioned this pull request Jan 3, 2021

[SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE #30962

Closed

cloud-fan reviewed Jan 4, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 4, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala Outdated Show resolved Hide resolved

update comment

c49065c

viirya reviewed Jan 4, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala Show resolved Hide resolved

zhongyu09 requested a review from cloud-fan January 4, 2021 08:12

reduce UT time within 5 sec

83e3e4e

zhongyu09 requested a review from viirya January 6, 2021 07:29

zhongyu09 changed the title ~~[SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE~~ [SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE Jan 6, 2021

cloud-fan reviewed Jan 6, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Jan 6, 2021

View reviewed changes

viirya approved these changes Jan 6, 2021

View reviewed changes

move UT from BroadcastJoinSuite to AdaptiveQueryExecSuite

7cbeb14

cloud-fan closed this in d36cdd5 Jan 7, 2021

zhongyu09 mentioned this pull request Jan 7, 2021

[SPARK-33933][SQL][3.0][test-maven] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE #31084

Closed

LuciferYang reviewed Jan 8, 2021

View reviewed changes

zhongyu09 mentioned this pull request Jan 9, 2021

[SPARK-33933][SQL] Remove UT before we completely fix SPARK-33933 #31099

Closed

zhongyu09 mentioned this pull request Jan 13, 2021

[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE #31167

Closed

zhongyu09 mentioned this pull request Jan 21, 2021

[SPARK-33933][SQL] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE #31269

Closed

zhongyu09 mentioned this pull request Jan 24, 2021

SPARK-33933][SQL][3.0] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE #31307

Closed

[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE #30998

[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE #30998

Uh oh!

Conversation

zhongyu09 commented Jan 3, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jan 4, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhongyu09 commented Jan 6, 2021

Uh oh!

Uh oh!

cloud-fan commented Jan 6, 2021

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 6, 2021

Uh oh!

SparkQA commented Jan 6, 2021

Uh oh!

SparkQA commented Jan 6, 2021

Uh oh!

SparkQA commented Jan 6, 2021

Uh oh!

SparkQA commented Jan 6, 2021

Uh oh!

SparkQA commented Jan 6, 2021

Uh oh!

cloud-fan commented Jan 6, 2021

Uh oh!

cloud-fan commented Jan 6, 2021

Uh oh!

SparkQA commented Jan 6, 2021

Uh oh!

SparkQA commented Jan 6, 2021

Uh oh!

SparkQA commented Jan 6, 2021

Uh oh!

cloud-fan commented Jan 7, 2021

Uh oh!

zhongyu09 commented Jan 7, 2021

Uh oh!

cloud-fan commented Jan 7, 2021

Uh oh!

zhongyu09 commented Jan 7, 2021

Uh oh!

cloud-fan commented Jan 7, 2021

Uh oh!

zhongyu09 commented Jan 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jan 10, 2021 •

edited

Loading