SPARK-33933][SQL][3.0] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE #31307

zhongyu09 · 2021-01-24T12:49:36Z

This PR is the same as #31269 to merge to branch 3.0

What changes were proposed in this pull request?

This PR is the same as #30998, but with a better UT.
In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
This partial fix only grantee the start of materialization for BroadcastQueryStage is prior to others, but because the submission of collect job for broadcasting is run in another thread, the issue is not completely solved.

Why are the changes needed?

When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map stage(job) and broadcast job are submitted almost at the same time, but map stage will hold all the computing resources. If the map stage runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread.

for broadcast job, call doPrepare() in main thread, and then start the real materialization in "broadcast-exchange-0" thread pool: calling getByteArrayRdd().collect() to submit collect job
for shuffle map job, call ShuffleExchangeExec.mapOutputStatisticsFuture() which call sparkContext.submitMapStage() directly in main thread to submit map stage

1 is trigger before 2, so in normal cases, the broadcast job will be submit first.
However, we can not control how fast the two thread runs, so the "broadcast-exchange-0" thread could run a little bit slower than main thread, result in map stage submit first. So there's still risk for the shuffle map job schedule earlier before broadcast job.

Since completely fix the issue is complex and might introduce major changes, we need more time to follow up. This partial fix is better than do nothing, it resolved most cases in SPARK-33933.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Add UT

…ut in AQE

AmplabJenkins · 2021-01-24T13:18:14Z

Can one of the admins verify this patch?

github-actions · 2021-05-05T00:05:07Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Materialize BroadcastQueryStage first to try to avoid broadcast timeo…

a39c884

…ut in AQE

zhongyu09 mentioned this pull request Jan 25, 2021

[SPARK-33933][SQL] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE #31269

Closed

github-actions bot added the Stale label May 5, 2021

github-actions bot closed this May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARK-33933][SQL][3.0] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE #31307

SPARK-33933][SQL][3.0] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE #31307

Uh oh!

zhongyu09 commented Jan 24, 2021

Uh oh!

AmplabJenkins commented Jan 24, 2021

Uh oh!

github-actions bot commented May 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SPARK-33933][SQL][3.0] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE #31307

SPARK-33933][SQL][3.0] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE #31307

Uh oh!

Conversation

zhongyu09 commented Jan 24, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Jan 24, 2021

Uh oh!

github-actions bot commented May 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants