-
Notifications
You must be signed in to change notification settings - Fork 972
[KYUUBI #5136][Bug] Spark App may hang forever if FinalStageResourceManager killed all executors #5141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| adjustTargetNumExecutors = true, | ||
| adjustTargetNumExecutors = false, | ||
| countFailures = false, | ||
| force = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we call client.requestTotalExecutors to adjust the target executor if targetExecutors < draTargetExecutors after kill executors ?
We might face a issue similar with apache/spark#19048
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we just need to tune adjustTargetNumExecutors from true to false and call client.requestTotalExecutors to ensure the final target executor is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. But build parameters of client.requestTotalExecutors involves many details.
So I prefer to wait ExecutorAllocationManager to call client.requestTotalExecutors.
As long as we did not call killExecutors with adjustTargetNumExecutors = true, ExecutorAllocationManager should be able to manage target executor num correctly.
|
|
||
| if (draTargetExecutors <= targetExecutors) { | ||
| // Ensure target executor number has been updated in cluster manager client | ||
| executorAllocationClient.requestExecutors(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I get it correctly. It seems we should do nothing if the dra target executor is smaller than our target executor. If it happens, the executors are pending to kill.
Codecov Report
@@ Coverage Diff @@
## master #5141 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 566 566
Lines 31590 31654 +64
Branches 4120 4124 +4
======================================
- Misses 31590 31654 +64 see 20 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…ourceManager killed all executors
92aaa08 to
9dcbc78
Compare
…ourceManager killed all executors
…ourceManager killed all executors
| spark: '3.3' | ||
| spark-archive: '-Dspark.archive.mirror=https://archive.apache.org/dist/spark/spark-3.1.3 -Dspark.archive.name=spark-3.1.3-bin-hadoop3.2.tgz -Pzookeeper-3.6' | ||
| exclude-tags: '-Dmaven.plugin.scalatest.exclude.tags=org.scalatest.tags.Slow,org.apache.kyuubi.tags.DeltaTest,org.apache.kyuubi.tags.IcebergTest' | ||
| exclude-tags: '-Dmaven.plugin.scalatest.exclude.tags=org.scalatest.tags.Slow,org.apache.kyuubi.tags.DeltaTest,org.apache.kyuubi.tags.IcebergTest,org.apache.kyuubi.tags.SparkLocalClusterTest' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task serialized by Spark 3.3 Driver can not be deserialized by Spark 3.1.3 Executor
|
cc @ulysses-you @pan3793 Ready for review. |
| eventually(timeout(Span(10, Minutes))) { | ||
| sql( | ||
| "CREATE TABLE final_stage AS SELECT id, count(*) as num FROM (SELECT 0 id) GROUP BY id") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we getAdjustedTargetExecutors at the end to make sure the target executor number is 1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Thanks, merged to master |
Why are the changes needed?
In minor cases, Spark Stage hangs forever when spark.sql.finalWriteStage.eagerlyKillExecutors.enabled is true.
The bug occurs if two conditions are met in the same time:
Target executor num in YarnAllocator will be set to 0 and no more executor will be launched.
Then ExecutorAllocationManager will not sync target executor num to YarnAllocator.
How was this patch tested?
FinalStageResourceManagerSuite