-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-31485][CORE][FOLLOW-UP] Also refer blacklisting in error message if barrier scheduling fail #28476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc people involved in last PR: @cloud-fan @tgravescs @mridulm @holdenk @jiangxb1987 |
| s" tasks got resource offers. This could happen if delay scheduling or " + | ||
| s"blacklisting is enabled, as barrier execution currently does not work " + | ||
| s"gracefully with them. We highly recommend you to disable delay scheduling " + | ||
| s"by setting spark.locality.wait=0 or disable blacklisting by setting " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should recommend disabling blacklisting. If it was supposed to be blacklisted assumption is something was wrong with it so job would probably fail anyway. Maybe we should just say it may have been blacklisted like before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review.
I realized a fact that blacklisting actually does not work for barrier taskset. As you may know, blacklisting only takes effect when there's failed task. But for a barrier task set, it will be marked as zombie once there's any failed task and we don't consider blacklisting for a zombie task set, see:
spark/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Lines 885 to 894 in 8b48629
| if (tasks(index).isBarrier) { | |
| isZombie = true | |
| } | |
| sched.dagScheduler.taskEnded(tasks(index), reason, null, accumUpdates, metricPeaks, info) | |
| if (!isZombie && reason.countTowardsTaskFailures) { | |
| assert (null != failureReason) | |
| taskSetBlacklistHelperOpt.foreach(_.updateBlacklistForFailedTask( | |
| info.host, info.executorId, index, failureReason)) |
So, I think blacklisting actually won't cause partial tasks launching. I will close this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Test build #122406 has finished for PR 28476 at commit
|
| s"blacklisting is enabled, as barrier execution currently does not work " + | ||
| s"gracefully with them. We highly recommend you to disable delay scheduling " + | ||
| s"by setting spark.locality.wait=0 or disable blacklisting by setting " + | ||
| s"spark.blacklist.enabled=false as a workaround if you see this error frequently." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might also want to check whether blacklist is actually enabled here.
What changes were proposed in this pull request?
This PR refer blacklisting as one of possible failure reason in the error message when barrier scheduling fail.
Why are the changes needed?
It's possible that partial tasks scheduling happens in barrier task set because of taskset level blacklisting:
spark/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Lines 420 to 424 in 263f04d
The original error message (before #28257) is somehow correct actually. We should follow it as well.
Does this PR introduce any user-facing change?
Yes. User could also see blacklisting(besides delay scheduling) as a possible reason for the failure.
How was this patch tested?
Pass Jenkins.