Skip to content

Conversation

@zhongyu09
Copy link
Contributor

What changes were proposed in this pull request?

In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
It can make sure the broadcast job are submitted before map jobs to avoid waiting for schedule and cause broadcast timeout.

Why are the changes needed?

When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

  1. Add UT
  2. Test the code using dev environment in https://issues.apache.org/jira/browse/SPARK-33933

@zhongyu09 zhongyu09 changed the base branch from master to branch-3.0 December 29, 2020 09:47
@zhongyu09 zhongyu09 changed the title [WIP][SPARK-33933][SQL] materialize BroadcastQueryState first to avoid broadcast timeout in AQE [WIP][SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE Dec 29, 2020
@zhongyu09
Copy link
Contributor Author

zhongyu09 commented Dec 29, 2020

Hi, @cloud-fan @maryannxue @Ngone51 @JkSelf, could you please take a look?

@zhongyu09 zhongyu09 changed the title [WIP][SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE [SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE Dec 31, 2020
@dongjoon-hyun
Copy link
Member

Could you rebase this PR once more to trigger GitHub Action, @zhongyu09 ?

@dongjoon-hyun
Copy link
Member

ok to test

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhongyu09 . Please open this PR to the master branch first.

To prevent a future regression, we need to fix the branch in an order: master -> branch-3.1 -> branch-3.0.

@SparkQA
Copy link

SparkQA commented Jan 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38184/

@SparkQA
Copy link

SparkQA commented Jan 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38184/

@SparkQA
Copy link

SparkQA commented Jan 2, 2021

Test build #133595 has finished for PR 30962 at commit e9ae1e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhongyu09 zhongyu09 closed this Jan 3, 2021
@zhongyu09 zhongyu09 deleted the aqe-broadcast branch January 3, 2021 12:39
@zhongyu09
Copy link
Contributor Author

Closing this PR and create another(#30998) based on master, sorry for inconvenience.

@dongjoon-hyun
Copy link
Member

Thank you, @zhongyu09 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants