-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE #30962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi, @cloud-fan @maryannxue @Ngone51 @JkSelf, could you please take a look? |
|
Could you rebase this PR once more to trigger GitHub Action, @zhongyu09 ? |
|
ok to test |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhongyu09 . Please open this PR to the master branch first.
To prevent a future regression, we need to fix the branch in an order: master -> branch-3.1 -> branch-3.0.
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #133595 has finished for PR 30962 at commit
|
|
Closing this PR and create another(#30998) based on master, sorry for inconvenience. |
|
Thank you, @zhongyu09 . |
What changes were proposed in this pull request?
In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
It can make sure the broadcast job are submitted before map jobs to avoid waiting for schedule and cause broadcast timeout.
Why are the changes needed?
When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.
Does this PR introduce any user-facing change?
NO
How was this patch tested?