Skip to content

Conversation

@zhongyu09
Copy link
Contributor

What changes were proposed in this pull request?

In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
It can make sure the broadcast job are submitted before map jobs to avoid waiting for job schedule and cause broadcast timeout.

Why are the changes needed?

When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

  1. Add UT
  2. Test the code using dev environment in https://issues.apache.org/jira/browse/SPARK-33933

@HyukjinKwon
Copy link
Member

cc @maryannxue and @cloud-fan FYI

@zhongyu09 zhongyu09 requested a review from cloud-fan January 4, 2021 08:12
@zhongyu09 zhongyu09 requested a review from viirya January 6, 2021 07:29
@zhongyu09
Copy link
Contributor Author

Hi @cloud-fan @viirya any other concerns? Can you approve the test for this PR (The old PR
#30962 test pass)

@zhongyu09 zhongyu09 changed the title [SPARK-33933][SQL] Materialize BroadcastQueryState first to avoid broadcast timeout in AQE [SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE Jan 6, 2021
@cloud-fan
Copy link
Contributor

ok to test

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks okay.

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38314/

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Test build #133726 has finished for PR 30998 at commit 83e3e4e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38314/

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38317/

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38317/

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Test build #133729 has finished for PR 30998 at commit 7cbeb14.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

let's see if the new test is flaky or not.

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38336/

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38336/

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Test build #133748 has finished for PR 30998 at commit 7cbeb14.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.1!

@cloud-fan cloud-fan closed this in d36cdd5 Jan 7, 2021
cloud-fan pushed a commit that referenced this pull request Jan 7, 2021
…adcast timeout in AQE

### What changes were proposed in this pull request?
In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
It can make sure the broadcast job are submitted before map jobs to avoid waiting for job schedule and cause broadcast timeout.

### Why are the changes needed?
When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
1. Add UT
2. Test the code using dev environment in https://issues.apache.org/jira/browse/SPARK-33933

Closes #30998 from zhongyu09/aqe-broadcast.

Authored-by: Yu Zhong <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d36cdd5)
Signed-off-by: Wenchen Fan <[email protected]>
@zhongyu09
Copy link
Contributor Author

Thanks @cloud-fan, why not merge to branch 3.0?

@cloud-fan
Copy link
Contributor

@zhongyu09 can you open a backport PR for 3.0? There are many AQE code changes in this release.

@zhongyu09
Copy link
Contributor Author

@zhongyu09 can you open a backport PR for 3.0? There are many AQE code changes in this release.

Sure, just like the old one #30962 right ?

I am puzzled why not directly merge to branch 3.0?

@cloud-fan
Copy link
Contributor

3.0 code base is very different from 3.1, and I'm afraid the test may fail. It's safer to make a PR to make sure all tests pass.

@zhongyu09
Copy link
Contributor Author

3.0 code base is very different from 3.1, and I'm afraid the test may fail. It's safer to make a PR to make sure all tests pass.

#31084

}
}

test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so it is still flaky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try the test tens of times and the test failed twice. As we discussed, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread. So there's still risk for the shuffle map job schedule earlier before broadcast job. I wonder should we need to remove the UT until we thorough resolve the issue.

CC @cloud-fan @dongjoon-hyun

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run SPARK-33933: AQE broadcast should not timeout with slow map tasks in local 5 times and all failed as follow:

- SPARK-33933: AQE broadcast should not timeout with slow map tasks *** FAILED ***
  1751 was not greater than 2000 (AdaptiveQueryExecSuite.scala:1454)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so it is still flaky.

yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spark conf changed to local[2] and so the running times are faster than before.

This shows the test is unreliable...

Checking the Spark jobs submission order should be easy to do and fast to run, and with retry it should be unlikely to fail. It's better to check stage submission order directly, if we can figure out how to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I know, the failure reported by @LuciferYang is easy to solve.
But the question is, the jobs submission order may be not correct, like the test in #31084.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan You mean with retry it should be unlikely to fail to solve the edge case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to create a new PR. Sorry for the inconvenience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get the stage submission time using SparkListeneer. But after trying serval times, the stage submission time is not stable thus the UT cannot always passed. I suggest to remove the UT before we completely solve the issue. #31099

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 10, 2021

Guys, let me revert this one. This causes test failure too often, and it blocks RC preparation.

The flakiness is more obvious when you check the jobs here: https://amplab.cs.berkeley.edu/jenkins/, and this blocks for me to check the test results from PySpark or SparkR at least for the RC.

@viirya
Copy link
Member

viirya commented Jan 10, 2021

+1 for reverting it.

@zhongyu09
Copy link
Contributor Author

+1

@zhongyu09
Copy link
Contributor Author

Create another PR #31167

@HyukjinKwon
Copy link
Member

Thanks @zhongyu09

cloud-fan pushed a commit that referenced this pull request Jan 22, 2021
…oid broadcast timeout in AQE

### What changes were proposed in this pull request?
This PR is the same as #30998, but with a better UT.
In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
This partial fix only grantee the start of materialization for BroadcastQueryStage is prior to others, but because the submission of collect job for broadcasting is run in another thread, the issue is not completely solved.

### Why are the changes needed?
When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map stage(job) and broadcast job are submitted almost at the same time, but map stage will hold all the computing resources. If the map stage runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread.
1. for broadcast job, call doPrepare() in main thread, and then start the real materialization in "broadcast-exchange-0" thread pool: calling getByteArrayRdd().collect() to submit collect job
2. for shuffle map job, call ShuffleExchangeExec.mapOutputStatisticsFuture() which call sparkContext.submitMapStage() directly in main thread to submit map stage

1 is trigger before 2, so in normal cases, the broadcast job will be submit first.
However, we can not control how fast the two thread runs, so the "broadcast-exchange-0" thread could run a little bit slower than main thread, result in map stage submit first. So there's still risk for the shuffle map job schedule earlier before broadcast job.

Since completely fix the issue is complex and might introduce major changes, we need more time to follow up. This partial fix is better than do nothing, it resolved most cases in SPARK-33933.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Add UT

Closes #31269 from zhongyu09/aqe-broadcast-partial-fix.

Authored-by: Yu Zhong <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…oid broadcast timeout in AQE

### What changes were proposed in this pull request?
This PR is the same as apache#30998, but with a better UT.
In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
This partial fix only grantee the start of materialization for BroadcastQueryStage is prior to others, but because the submission of collect job for broadcasting is run in another thread, the issue is not completely solved.

### Why are the changes needed?
When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map stage(job) and broadcast job are submitted almost at the same time, but map stage will hold all the computing resources. If the map stage runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread.
1. for broadcast job, call doPrepare() in main thread, and then start the real materialization in "broadcast-exchange-0" thread pool: calling getByteArrayRdd().collect() to submit collect job
2. for shuffle map job, call ShuffleExchangeExec.mapOutputStatisticsFuture() which call sparkContext.submitMapStage() directly in main thread to submit map stage

1 is trigger before 2, so in normal cases, the broadcast job will be submit first.
However, we can not control how fast the two thread runs, so the "broadcast-exchange-0" thread could run a little bit slower than main thread, result in map stage submit first. So there's still risk for the shuffle map job schedule earlier before broadcast job.

Since completely fix the issue is complex and might introduce major changes, we need more time to follow up. This partial fix is better than do nothing, it resolved most cases in SPARK-33933.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Add UT

Closes apache#31269 from zhongyu09/aqe-broadcast-partial-fix.

Authored-by: Yu Zhong <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
domybest11 pushed a commit to domybest11/spark that referenced this pull request Jun 15, 2022
…oid broadcast timeout in AQE

### What changes were proposed in this pull request?
This PR is the same as apache#30998, but with a better UT.
In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
This partial fix only grantee the start of materialization for BroadcastQueryStage is prior to others, but because the submission of collect job for broadcasting is run in another thread, the issue is not completely solved.

### Why are the changes needed?
When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map stage(job) and broadcast job are submitted almost at the same time, but map stage will hold all the computing resources. If the map stage runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread.
1. for broadcast job, call doPrepare() in main thread, and then start the real materialization in "broadcast-exchange-0" thread pool: calling getByteArrayRdd().collect() to submit collect job
2. for shuffle map job, call ShuffleExchangeExec.mapOutputStatisticsFuture() which call sparkContext.submitMapStage() directly in main thread to submit map stage

1 is trigger before 2, so in normal cases, the broadcast job will be submit first.
However, we can not control how fast the two thread runs, so the "broadcast-exchange-0" thread could run a little bit slower than main thread, result in map stage submit first. So there's still risk for the shuffle map job schedule earlier before broadcast job.

Since completely fix the issue is complex and might introduce major changes, we need more time to follow up. This partial fix is better than do nothing, it resolved most cases in SPARK-33933.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Add UT

Closes apache#31269 from zhongyu09/aqe-broadcast-partial-fix.

Authored-by: Yu Zhong <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants