-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30359][CORE] Don't clear executorsPendingToRemove at the beginning of CoarseGrainedSchedulerBackend.reset #27017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #115812 has finished for PR 27017 at commit
|
|
Test build #115816 has finished for PR 27017 at commit
|
|
Test build #115837 has finished for PR 27017 at commit
|
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
Outdated
Show resolved
Hide resolved
| assert(sched.taskSetsFailed.contains(taskSet.id)) | ||
| } | ||
|
|
||
| test("SPARK-30359: Don't clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't clean executorsPendingToRemove at the beginning of 'reset'. We do clear it eventually.
|
Test build #115842 has finished for PR 27017 at commit
|
jiangxb1987
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM only nits
| // use local-cluster mode in order to get CoarseGrainedSchedulerBackend | ||
| .setMaster("local-cluster[2, 1, 2048]") | ||
| // allow to set up at most two executors | ||
| .set("spark.cores.max", "2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we still need this config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to create at most 2 executors at the beginning...Though, this may not necessary..
| backend.reset() | ||
|
|
||
| eventually(timeout(10.seconds), interval(100.milliseconds)) { | ||
| // executorsPendingToRemove should still be empty after reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: stil -> eventually
| assert(manager.invokePrivate(numFailures())(index0) === 0) | ||
| assert(manager.invokePrivate(numFailures())(index1) === 1) | ||
| } | ||
| sc.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessary coz LocalSparkContext would stop it after each test case.
|
Jenkins retest this please |
|
|
||
| test("SPARK-30359: Don't clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset") | ||
| { | ||
| test("SPARK-30359: don't clean executorsPendingToRemove at the beginning of 'reset'") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you still need to mention CoarseGrainedSchedulerBackend.reset
|
Test build #115876 has finished for PR 27017 at commit
|
|
Test build #115877 has finished for PR 27017 at commit
|
|
Test build #115879 has finished for PR 27017 at commit
|
|
|
||
| // task0 on exec0 should not count failures | ||
| backend.executorsPendingToRemove(exec0) = true | ||
| // task1 on exec1 should count failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what makes exec1 different from exec0 and count failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, executorsPendingToRemove(exec0)=true while executorsPendingToRemove(exec1)=false. And false means that the crash of executor may possibly related to bad tasks running on it. So, those task should be counted failures. However, true means the executor is killed by driver and has non business of tasks.
jiangxb1987
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Test build #116080 has finished for PR 27017 at commit
|
|
retest this please |
|
Test build #116084 has finished for PR 27017 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
Remove
executorsPendingToRemove.clear()fromCoarseGrainedSchedulerBackend.reset().Why are the changes needed?
Clear
executorsPendingToRemovebefore remove executors will cause all tasks running on those "pending to remove" executors to count failures. But that's not true for the case ofexecutorsPendingToRemove(execId)=true.Besides,
executorsPendingToRemovewill be cleaned up withinremoveExecutor()at the end just as same asexecutorsPendingLossReason.Does this PR introduce any user-facing change?
No
How was this patch tested?
Added a new test in
TaskSetManagerSuite.