[SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE #30962

zhongyu09 · 2020-12-29T09:44:45Z

What changes were proposed in this pull request?

In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
It can make sure the broadcast job are submitted before map jobs to avoid waiting for schedule and cause broadcast timeout.

Why are the changes needed?

When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Add UT
Test the code using dev environment in https://issues.apache.org/jira/browse/SPARK-33933

… timeout in AQE

zhongyu09 · 2020-12-29T10:20:46Z

Hi, @cloud-fan @maryannxue @Ngone51 @JkSelf, could you please take a look?

dongjoon-hyun · 2021-01-02T17:00:35Z

Could you rebase this PR once more to trigger GitHub Action, @zhongyu09 ?

dongjoon-hyun · 2021-01-02T17:00:41Z

ok to test

dongjoon-hyun

@zhongyu09 . Please open this PR to the master branch first.

To prevent a future regression, we need to fix the branch in an order: master -> branch-3.1 -> branch-3.0.

SparkQA · 2021-01-02T18:13:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38184/

SparkQA · 2021-01-02T18:38:42Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38184/

SparkQA · 2021-01-02T21:31:02Z

Test build #133595 has finished for PR 30962 at commit e9ae1e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhongyu09 · 2021-01-03T12:45:53Z

Closing this PR and create another(#30998) based on master, sorry for inconvenience.

dongjoon-hyun · 2021-01-04T07:33:40Z

Thank you, @zhongyu09 .

SPARK-33933: materialize BroadcastQueryState first to avoid broadcast…

e9ae1e8

… timeout in AQE

github-actions bot added AVRO BUILD CORE DEPLOY DOCS DSTREAM EXAMPLES GRAPHX INFRA KUBERNETES MESOS ML MLLIB PYTHON R SPARK SHELL SQL STRUCTURED STREAMING WEB UI WINDOWS YARN labels Dec 29, 2020

zhongyu09 changed the base branch from master to branch-3.0 December 29, 2020 09:47

zhongyu09 changed the title ~~[WIP][SPARK-33933][SQL] materialize BroadcastQueryState first to avoid broadcast timeout in AQE~~ [WIP][SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE Dec 29, 2020

zhongyu09 changed the title ~~[WIP][SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE~~ [SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE Dec 31, 2020

dongjoon-hyun requested changes Jan 2, 2021

View reviewed changes

zhongyu09 closed this Jan 3, 2021

zhongyu09 deleted the aqe-broadcast branch January 3, 2021 12:39

zhongyu09 mentioned this pull request Jan 6, 2021

[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE #30998

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE #30962

[SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE #30962

Uh oh!

zhongyu09 commented Dec 29, 2020

Uh oh!

zhongyu09 commented Dec 29, 2020 •

edited

Loading

Uh oh!

dongjoon-hyun commented Jan 2, 2021

Uh oh!

dongjoon-hyun commented Jan 2, 2021

Uh oh!

dongjoon-hyun left a comment

Uh oh!

SparkQA commented Jan 2, 2021

Uh oh!

SparkQA commented Jan 2, 2021

Uh oh!

SparkQA commented Jan 2, 2021

Uh oh!

zhongyu09 commented Jan 3, 2021

Uh oh!

dongjoon-hyun commented Jan 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE #30962

[SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE #30962

Uh oh!

Conversation

zhongyu09 commented Dec 29, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhongyu09 commented Dec 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jan 2, 2021

Uh oh!

dongjoon-hyun commented Jan 2, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 2, 2021

Uh oh!

SparkQA commented Jan 2, 2021

Uh oh!

SparkQA commented Jan 2, 2021

Uh oh!

zhongyu09 commented Jan 3, 2021

Uh oh!

dongjoon-hyun commented Jan 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhongyu09 commented Dec 29, 2020 •

edited

Loading