[SPARK-33933][SQL] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE #31269

zhongyu09 · 2021-01-21T03:49:20Z

What changes were proposed in this pull request?

This PR is the same as #30998, but with a better UT.
In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
This partial fix only grantee the start of materialization for BroadcastQueryStage is prior to others, but because the submission of collect job for broadcasting is run in another thread, the issue is not completely solved.

Why are the changes needed?

When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map stage(job) and broadcast job are submitted almost at the same time, but map stage will hold all the computing resources. If the map stage runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread.

for broadcast job, call doPrepare() in main thread, and then start the real materialization in "broadcast-exchange-0" thread pool: calling getByteArrayRdd().collect() to submit collect job
for shuffle map job, call ShuffleExchangeExec.mapOutputStatisticsFuture() which call sparkContext.submitMapStage() directly in main thread to submit map stage

1 is trigger before 2, so in normal cases, the broadcast job will be submit first.
However, we can not control how fast the two thread runs, so the "broadcast-exchange-0" thread could run a little bit slower than main thread, result in map stage submit first. So there's still risk for the shuffle map job schedule earlier before broadcast job.

Since completely fix the issue is complex and might introduce major changes, we need more time to follow up. This partial fix is better than do nothing, it resolved most cases in SPARK-33933.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Add UT

…e first to avoid broadcast timeout in AQE

cloud-fan · 2021-01-21T05:06:01Z

ok to test

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

SparkQA · 2021-01-21T05:56:13Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38894/

SparkQA · 2021-01-21T06:22:13Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38894/

SparkQA · 2021-01-21T07:21:22Z

Test build #134308 has finished for PR 31269 at commit 5699cf4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhongyu09 · 2021-01-21T07:48:09Z

retest this please

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

SparkQA · 2021-01-21T09:08:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38902/

SparkQA · 2021-01-21T09:10:17Z

Test build #134315 has finished for PR 31269 at commit 08edec5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-21T09:57:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38902/

zhongyu09 · 2021-01-21T14:17:46Z

Test build #134315 has finished for PR 31269 at commit 08edec5.

This patch fails Spark unit tests.

This patch merges cleanly.

This patch adds no public classes.

It seems the test failure is irrelevant this time

zhongyu09 · 2021-01-21T14:18:02Z

retest this please

SparkQA · 2021-01-21T15:02:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38922/

SparkQA · 2021-01-21T15:32:37Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38922/

SparkQA · 2021-01-21T18:32:43Z

Test build #134335 has finished for PR 31269 at commit d0b8ee3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-01-22T07:22:48Z

thanks, merging to master!

zhongyu09 · 2021-01-25T02:05:27Z

thanks, merging to master!

#31307 merge to branch 3.0
Do we needs extra PR to merge to branch 3.1?

cloud-fan · 2021-01-25T04:35:54Z

It's more of an improvement, so usually we don't backport.

HyukjinKwon · 2021-01-25T09:19:20Z

isn't it a bug though? spark.sql.broadcastTimeout wasn't respected properly and took the shuffle time into account.

zhongyu09 · 2021-01-25T09:30:53Z

I think it's a bug (partial) fix rather than an improvement.
spark.sql.broadcastTimeout took the shuffle/collect time into account is fine for me, but shouldn't took the schedule time into account.

cloud-fan · 2021-01-25T13:48:35Z

It's only a problem when the cluster doesn't have sufficient resource and make the schedule time very long. Usually, it's fine to ignore schedule time.

zhongyu09 · 2021-01-27T02:11:43Z

But with the cost saving tendency, the situation of insufficient resource will become more and more popular.

…oid broadcast timeout in AQE ### What changes were proposed in this pull request? This PR is the same as apache#30998, but with a better UT. In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others. This partial fix only grantee the start of materialization for BroadcastQueryStage is prior to others, but because the submission of collect job for broadcasting is run in another thread, the issue is not completely solved. ### Why are the changes needed? When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map stage(job) and broadcast job are submitted almost at the same time, but map stage will hold all the computing resources. If the map stage runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475). The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small. The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread. 1. for broadcast job, call doPrepare() in main thread, and then start the real materialization in "broadcast-exchange-0" thread pool: calling getByteArrayRdd().collect() to submit collect job 2. for shuffle map job, call ShuffleExchangeExec.mapOutputStatisticsFuture() which call sparkContext.submitMapStage() directly in main thread to submit map stage 1 is trigger before 2, so in normal cases, the broadcast job will be submit first. However, we can not control how fast the two thread runs, so the "broadcast-exchange-0" thread could run a little bit slower than main thread, result in map stage submit first. So there's still risk for the shuffle map job schedule earlier before broadcast job. Since completely fix the issue is complex and might introduce major changes, we need more time to follow up. This partial fix is better than do nothing, it resolved most cases in SPARK-33933. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Add UT Closes apache#31269 from zhongyu09/aqe-broadcast-partial-fix. Authored-by: Yu Zhong <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

zhongyu09 · 2021-05-08T08:21:39Z

Hi @cloud-fan @HyukjinKwon @viirya, I have some follow up to resolved the issue completely, it is better to reuse the JIRA SPARK-33933 or create a new JIRA?

cloud-fan · 2021-05-10T06:44:31Z

@zhongyu09 please open a new JIRA, thanks!

…oid broadcast timeout in AQE ### What changes were proposed in this pull request? This PR is the same as apache#30998, but with a better UT. In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others. This partial fix only grantee the start of materialization for BroadcastQueryStage is prior to others, but because the submission of collect job for broadcasting is run in another thread, the issue is not completely solved. ### Why are the changes needed? When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map stage(job) and broadcast job are submitted almost at the same time, but map stage will hold all the computing resources. If the map stage runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475). The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small. The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread. 1. for broadcast job, call doPrepare() in main thread, and then start the real materialization in "broadcast-exchange-0" thread pool: calling getByteArrayRdd().collect() to submit collect job 2. for shuffle map job, call ShuffleExchangeExec.mapOutputStatisticsFuture() which call sparkContext.submitMapStage() directly in main thread to submit map stage 1 is trigger before 2, so in normal cases, the broadcast job will be submit first. However, we can not control how fast the two thread runs, so the "broadcast-exchange-0" thread could run a little bit slower than main thread, result in map stage submit first. So there's still risk for the shuffle map job schedule earlier before broadcast job. Since completely fix the issue is complex and might introduce major changes, we need more time to follow up. This partial fix is better than do nothing, it resolved most cases in SPARK-33933. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Add UT Closes apache#31269 from zhongyu09/aqe-broadcast-partial-fix. Authored-by: Yu Zhong <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

SPARK-33933 partial fix: start materialization for BroadcastQueryStat…

5699cf4

…e first to avoid broadcast timeout in AQE

github-actions bot added the SQL label Jan 21, 2021

zhongyu09 mentioned this pull request Jan 21, 2021

[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE #31167

Closed

zhongyu09 changed the title ~~[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE~~ [SPARK-33933][SQL] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE Jan 21, 2021

cloud-fan reviewed Jan 21, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala Outdated Show resolved Hide resolved

simplify UT

08edec5

viirya reviewed Jan 21, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala Outdated Show resolved Hide resolved

fix typo

d0b8ee3

cloud-fan approved these changes Jan 22, 2021

View reviewed changes

cloud-fan closed this in 2db0a95 Jan 22, 2021

zhongyu09 mentioned this pull request Jan 24, 2021

SPARK-33933][SQL][3.0] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE #31307

Closed

zhongyu09 mentioned this pull request May 16, 2021

[WIP][SPARK-35414][SQL] Submit broadcast collect job first to avoid broadcast timeout in AQE #32562

Closed

zhongyu09 mentioned this pull request Jun 1, 2021

[SPARK-35595][TESTS] Support multiple loggers in testing method withLogAppender #32725

Closed

[SPARK-33933][SQL] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE #31269

[SPARK-33933][SQL] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE #31269

Uh oh!

Conversation

zhongyu09 commented Jan 21, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Jan 21, 2021

Uh oh!

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

zhongyu09 commented Jan 21, 2021

Uh oh!

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

zhongyu09 commented Jan 21, 2021

Uh oh!

zhongyu09 commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

cloud-fan commented Jan 22, 2021

Uh oh!

zhongyu09 commented Jan 25, 2021

Uh oh!

cloud-fan commented Jan 25, 2021

Uh oh!

HyukjinKwon commented Jan 25, 2021

Uh oh!

zhongyu09 commented Jan 25, 2021

Uh oh!

cloud-fan commented Jan 25, 2021

Uh oh!

zhongyu09 commented Jan 27, 2021

Uh oh!

zhongyu09 commented May 8, 2021

Uh oh!

cloud-fan commented May 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants