Skip to content

Conversation

@zhouyifan279
Copy link
Contributor

@zhouyifan279 zhouyifan279 commented Aug 8, 2023

Why are the changes needed?

In minor cases, Spark Stage hangs forever when spark.sql.finalWriteStage.eagerlyKillExecutors.enabled is true.

The bug occurs if two conditions are met in the same time:

  1. All executors are either removed because of idle time out or killed by FinalStageResourceManager.
    Target executor num in YarnAllocator will be set to 0 and no more executor will be launched.
  2. Target executor num in ExecutorAllocationManager equals to the executor num needed by final stage.
    Then ExecutorAllocationManager will not sync target executor num to YarnAllocator.

How was this patch tested?

  • Add a new test suite FinalStageResourceManagerSuite

@zhouyifan279 zhouyifan279 changed the title [KYUUBI #5136][Bug] Some Spark App may hang forever when FinalStageRe… [WIP][KYUUBI #5136][Bug] Some Spark App may hang forever when FinalStageRe… Aug 8, 2023
adjustTargetNumExecutors = true,
adjustTargetNumExecutors = false,
countFailures = false,
force = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we call client.requestTotalExecutors to adjust the target executor if targetExecutors < draTargetExecutors after kill executors ?

We might face a issue similar with apache/spark#19048

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we just need to tune adjustTargetNumExecutors from true to false and call client.requestTotalExecutors to ensure the final target executor is expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. But build parameters of client.requestTotalExecutors involves many details.
So I prefer to wait ExecutorAllocationManager to call client.requestTotalExecutors.

As long as we did not call killExecutors with adjustTargetNumExecutors = true, ExecutorAllocationManager should be able to manage target executor num correctly.

@zhouyifan279 zhouyifan279 changed the title [WIP][KYUUBI #5136][Bug] Some Spark App may hang forever when FinalStageRe… [WIP][KYUUBI #5136][Bug] Spark Application may hang forever after FinalStageResourceManager kills all the executors Aug 8, 2023

if (draTargetExecutors <= targetExecutors) {
// Ensure target executor number has been updated in cluster manager client
executorAllocationClient.requestExecutors(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get it correctly. It seems we should do nothing if the dra target executor is smaller than our target executor. If it happens, the executors are pending to kill.

@codecov-commenter
Copy link

codecov-commenter commented Aug 8, 2023

Codecov Report

Merging #5141 (ea8f247) into master (9a001c8) will not change coverage.
Report is 13 commits behind head on master.
The diff coverage is n/a.

❗ Current head ea8f247 differs from pull request most recent head c4403ee. Consider uploading reports for the commit c4403ee to get more accurate results

@@          Coverage Diff           @@
##           master   #5141   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         566     566           
  Lines       31590   31654   +64     
  Branches     4120    4124    +4     
======================================
- Misses      31590   31654   +64     

see 20 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@zhouyifan279 zhouyifan279 changed the title [WIP][KYUUBI #5136][Bug] Spark Application may hang forever after FinalStageResourceManager kills all the executors [WIP][KYUUBI #5136][Bug] Spark App may hang forever if FinalStageResourceManager killed all executors Aug 9, 2023
@zhouyifan279 zhouyifan279 changed the title [WIP][KYUUBI #5136][Bug] Spark App may hang forever if FinalStageResourceManager killed all executors [KYUUBI #5136][Bug] Spark App may hang forever if FinalStageResourceManager killed all executors Aug 9, 2023
@github-actions github-actions bot added the kind:infra license, community building, project builds, asf infra related, etc. label Aug 15, 2023
spark: '3.3'
spark-archive: '-Dspark.archive.mirror=https://archive.apache.org/dist/spark/spark-3.1.3 -Dspark.archive.name=spark-3.1.3-bin-hadoop3.2.tgz -Pzookeeper-3.6'
exclude-tags: '-Dmaven.plugin.scalatest.exclude.tags=org.scalatest.tags.Slow,org.apache.kyuubi.tags.DeltaTest,org.apache.kyuubi.tags.IcebergTest'
exclude-tags: '-Dmaven.plugin.scalatest.exclude.tags=org.scalatest.tags.Slow,org.apache.kyuubi.tags.DeltaTest,org.apache.kyuubi.tags.IcebergTest,org.apache.kyuubi.tags.SparkLocalClusterTest'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task serialized by Spark 3.3 Driver can not be deserialized by Spark 3.1.3 Executor

@zhouyifan279
Copy link
Contributor Author

cc @ulysses-you @pan3793 Ready for review.

eventually(timeout(Span(10, Minutes))) {
sql(
"CREATE TABLE final_stage AS SELECT id, count(*) as num FROM (SELECT 0 id) GROUP BY id")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we getAdjustedTargetExecutors at the end to make sure the target executor number is 1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@pan3793 pan3793 added this to the v1.8.0 milestone Aug 16, 2023
@pan3793 pan3793 closed this in d513f1f Aug 16, 2023
@pan3793
Copy link
Member

pan3793 commented Aug 16, 2023

Thanks, merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind:build kind:infra license, community building, project builds, asf infra related, etc. module:extensions module:spark

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants