Skip to content

Conversation

@zhongyu09
Copy link
Contributor

This PR is the same as #30998 to merge to branch 3.0

What changes were proposed in this pull request?

In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
It can make sure the broadcast job are submitted before map jobs to avoid waiting for job schedule and cause broadcast timeout.

Why are the changes needed?

When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

  1. Add UT
  2. Test the code using dev environment in https://issues.apache.org/jira/browse/SPARK-33933

@dongjoon-hyun
Copy link
Member

The failure is relevant to AQE. Could you take a look at it, @zhongyu09 ?
Screen Shot 2021-01-07 at 4 19 54 PM

@zhongyu09
Copy link
Contributor Author

The failure is relevant to AQE. Could you take a look at it, @zhongyu09 ?
Screen Shot 2021-01-07 at 4 19 54 PM

I will have a look.

@dongjoon-hyun
Copy link
Member

Thanks!

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38409/

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38409/

@zhongyu09
Copy link
Contributor Author

Hi @dongjoon-hyun
I try the test tens of times and the test failed twice.
As we discussed with @viirya and @cloud-fan in #30998, the solution is not perfect. The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread. So there's still risk for the shuffle map job schedule earlier before broadcast job. I wonder should we need to remove the UT until we thorough resolve the issue.

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Test build #133820 has finished for PR 31084 at commit 03c9b09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 8, 2021

Got it. I'm not sure about removing UTs. BTW, Jenkins passed at least.

WDYT, @cloud-fan and @HyukjinKwon ?

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-33933][SQL][3.0] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE [SPARK-33933][SQL][3.0][test-maven] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE Jan 8, 2021
@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38446/

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38446/

@SparkQA
Copy link

SparkQA commented Jan 9, 2021

Test build #133857 has finished for PR 31084 at commit 03c9b09.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

The same test case failed in Jenkins Maven environment again. (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133857/testReport)

- SPARK-33933: AQE broadcast should not timeout with slow map tasks *** FAILED ***
  1693 was not greater than 2000 (AdaptiveQueryExecSuite.scala:926)
14:59:38.190 WARN org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite: 

@dongjoon-hyun
Copy link
Member

I'm wondering if this test is stable in master and branch-3.1.

@zhongyu09
Copy link
Contributor Author

I'm wondering if this test is stable in master and branch-3.1.

Yes, also not stable in master. I create a PR to remove the UT. #31099

@HyukjinKwon
Copy link
Member

A partial fix is fine but let's make sure to mention/document what cases the partial fix does not cover.

@HyukjinKwon
Copy link
Member

I reverted it for now for RC preparation . Let's make a PR with clarifying which case it doesn't cover, and why this is a partial fix.

@zhongyu09
Copy link
Contributor Author

I reverted it for now for RC preparation . Let's make a PR with clarifying which case it doesn't cover, and why this is a partial fix.

For partial fix, it is difficult to give an stable UT. I would rather give an stable fix. I think two directions:

  1. make sure broadcast job is submitted before shuffle map job, the calling of materialize() for non-broadcast query stage should wait until all the broadcast jobs are submitted.
  2. excluded the schedule time for broadcast job when we calculate time out. This is very hard to measure. For downgrade, perhaps we can measure the time for pure broadcast, that is, minus collect time. But this also has big changes, as well as changes for non-AQE.

I prefer for #1, it behavior more like non-AQE and is this PR's original intention and will have less impact to non-AQE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants