[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE #31167

zhongyu09 · 2021-01-13T08:55:09Z

What changes were proposed in this pull request?

In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, materialize BroadcastQueryStage first and wait the materialization finish before materialize other (ShuffleQueryStage) stages.
It can make sure the broadcast job are scheduled and finished before map jobs to avoid waiting for job schedule and cause broadcast timeout. This is the same behavior with non-AQE queries.
Actually, we want to only control the schedule for broadcast job is before map jobs. However, it is difficult to control and may have large changes to spark-core. So the trade off is wait broadcast job finish before materialize ShuffleQueryStage.

Consider the case, a is a large table, b and c are very small in-memory dimension tables.

SELECT a.id, a.name, b.name, c.name, count(a.value) 
FROM a 
JOIN b on a.id = b.id 
JOIN c on a.name = c.name 
GROUP BY a.id, a.name

For non-AQE:

run collect b, then broadcast b
run collect c, then broadcast c
submit job which contains 2 stage

For current AQE:

submit 3 job( shuffle map stage for a, collect b and broadcast b, collect c broadcast c ) almost at the same time
when all finished, run result stage

For AQE with this PR:

submit 2 job(collect b and broadcast b, collect c broadcast c) at the same time
wait broadcast of b and c finish
run shuffle map stage
run result stage

Why are the changes needed?

In non-AQE, we always wait the broadcast finish before submit shuffle map tasks.

When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map stage(job) and broadcast job are submitted almost at the same time, but map stage will hold all the computing resources. If the map stage runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).

The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

#30998 give a solution by sort the new stages by class type to make sure the calling of materialize() for BroadcastQueryState precede others. However, the solution is not perfect and because of the flaky of UT, it is revered. The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread.

for broadcast job, call doPrepare() in main thread, and then start the real materialization in "broadcast-exchange-0" thread pool: calling getByteArrayRdd().collect() to submit collect job
for shuffle map job, call ShuffleExchangeExec.mapOutputStatisticsFuture() which call sparkContext.submitMapStage() directly in main thread to submit map stage

1 is trigger before 2, so in normal cases, the broadcast job will be submit first.
However, we can not control how fast the two thread runs, so the "broadcast-exchange-0" thread could run a little bit slower than main thread, result in map stage submit first So there's still risk for the shuffle map job schedule earlier before broadcast job.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Add UT
Test the code using dev environment in https://issues.apache.org/jira/browse/SPARK-33933

…ization finish before materialize other stages

cloud-fan · 2021-01-13T09:14:29Z

With this PR, I think there will be an AQE perf regression if the cluster has sufficient resources.

zhongyu09 · 2021-01-13T13:59:53Z

Hi @HyukjinKwon @dongjoon-hyun @cloud-fan @viirya @LuciferYang @maryannxue, please help review. Let's have sufficient test this time.

LuciferYang · 2021-01-13T14:43:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

+          val broadcastMaterializationFutures = result.newStages
+            .filter(_.isInstanceOf[BroadcastQueryStageExec])
+            .map { stage =>
+            var future: Future[Any] = null


indent: line 201 ~216

I am not sure line 201 ~ 215 should have 2 more space indent. Just behavior same as line 225~ 236 (old code).

should be ：）

LuciferYang · 2021-01-13T15:15:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

+          }
+
+          // Wait for the materialization of all broadcast stages finish
+          broadcastMaterializationFutures.foreach(ThreadUtils.awaitReady(_, Duration.Inf))


Is it necessary to wait until all BroadcastQueryStageExec are materialized, this may cause waste of resources as @cloud-fan said

In deed, there will be a little waste of resources. This is the same behavior as non-AQE. Given the lightweight of broadcast, it should not cause too much time, few seconds in normal. I think that's acceptable.
If not wait, there's still probability that situations in #30998 will occur and cause broadcast timeout.

@zhongyu09 It might be better to give a benchmark to compare the performance difference between before and after

Yes, do we have some benchmark testing framework?

Micro benchmark can base on BenchmarkBase or SqlBasedBenchmark, like DataSourceReadBenchmark. But for this scenario, I prefer to you can give a description of the test process and a comparison of the benchmark numbers, maybe need some screenshot

I am fine with the partial fix like #30998. I wonder is it too heavy to add new event just for UT?
I tend to fix the problem without perf regression. But we can also let the partial fix goes first.

We can also log the stage submission and then write test to verify the log.

That's an idea. I will have a look for how to do this. Do we have any UT to verify the log?

yea a lot, e.g. AdaptiveQueryExecSuite.test log level

Put a partial fix as discussed in #31269 cc @viirya

HyukjinKwon · 2021-01-14T00:37:13Z

ok to test

SparkQA · 2021-01-14T01:27:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38617/

SparkQA · 2021-01-14T01:52:27Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38617/

SparkQA · 2021-01-14T05:05:35Z

Test build #134030 has finished for PR 31167 at commit 6bc38f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

dongjoon-hyun · 2021-03-20T04:54:14Z

This PR seems to be superseded by the author at #31269

[SPARK-33933][SQL] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE

I'll close this. Please feel free to reopen this if I'm wrong.

zhongyu09 · 2021-03-22T06:39:59Z

This PR seems to be superseded by the author at #31269
[SPARK-33933][SQL] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE
I'll close this. Please feel free to reopen this if I'm wrong.

Thanks, it is superseded.

SPARK-33933: materialize broadcast stages first and wait the material…

8bfb1e5

…ization finish before materialize other stages

zhongyu09 mentioned this pull request Jan 13, 2021

[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE #30998

Closed

github-actions bot added the SQL label Jan 13, 2021

fix comment typo

6bc38f0

LuciferYang reviewed Jan 13, 2021

View reviewed changes

viirya reviewed Jan 14, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala Show resolved Hide resolved

dongjoon-hyun closed this Mar 20, 2021

[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE #31167

[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE #31167

Uh oh!

Conversation

zhongyu09 commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Jan 13, 2021

Uh oh!

zhongyu09 commented Jan 13, 2021

Uh oh!

LuciferYang Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhongyu09 Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 14, 2021

Uh oh!

SparkQA commented Jan 14, 2021

Uh oh!

SparkQA commented Jan 14, 2021

Uh oh!

SparkQA commented Jan 14, 2021

Uh oh!

Uh oh!

dongjoon-hyun commented Mar 20, 2021

Uh oh!

zhongyu09 commented Mar 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

zhongyu09 commented Jan 13, 2021 •

edited

Loading

LuciferYang Jan 13, 2021 •

edited

Loading

zhongyu09 Jan 14, 2021 •

edited

Loading

LuciferYang Jan 13, 2021 •

edited

Loading