Skip to content

Conversation

@Ngone51
Copy link
Member

@Ngone51 Ngone51 commented May 7, 2020

What changes were proposed in this pull request?

This PR refer blacklisting as one of possible failure reason in the error message when barrier scheduling fail.

Why are the changes needed?

It's possible that partial tasks scheduling happens in barrier task set because of taskset level blacklisting:

val offerBlacklisted = taskSetBlacklistHelperOpt.exists { blacklist =>
blacklist.isNodeBlacklistedForTaskSet(host) ||
blacklist.isExecutorBlacklistedForTaskSet(execId)
}
if (!isZombie && !offerBlacklisted) {

The original error message (before #28257) is somehow correct actually. We should follow it as well.

Does this PR introduce any user-facing change?

Yes. User could also see blacklisting(besides delay scheduling) as a possible reason for the failure.

How was this patch tested?

Pass Jenkins.

@Ngone51
Copy link
Member Author

Ngone51 commented May 7, 2020

cc people involved in last PR: @cloud-fan @tgravescs @mridulm @holdenk @jiangxb1987

s" tasks got resource offers. This could happen if delay scheduling or " +
s"blacklisting is enabled, as barrier execution currently does not work " +
s"gracefully with them. We highly recommend you to disable delay scheduling " +
s"by setting spark.locality.wait=0 or disable blacklisting by setting " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should recommend disabling blacklisting. If it was supposed to be blacklisted assumption is something was wrong with it so job would probably fail anyway. Maybe we should just say it may have been blacklisted like before.

Copy link
Member Author

@Ngone51 Ngone51 May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review.

I realized a fact that blacklisting actually does not work for barrier taskset. As you may know, blacklisting only takes effect when there's failed task. But for a barrier task set, it will be marked as zombie once there's any failed task and we don't consider blacklisting for a zombie task set, see:

if (tasks(index).isBarrier) {
isZombie = true
}
sched.dagScheduler.taskEnded(tasks(index), reason, null, accumUpdates, metricPeaks, info)
if (!isZombie && reason.countTowardsTaskFailures) {
assert (null != failureReason)
taskSetBlacklistHelperOpt.foreach(_.updateBlacklistForFailedTask(
info.host, info.executorId, index, failureReason))

So, I think blacklisting actually won't cause partial tasks launching. I will close this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@SparkQA
Copy link

SparkQA commented May 7, 2020

Test build #122406 has finished for PR 28476 at commit 9df2bf0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51 Ngone51 closed this May 8, 2020
s"blacklisting is enabled, as barrier execution currently does not work " +
s"gracefully with them. We highly recommend you to disable delay scheduling " +
s"by setting spark.locality.wait=0 or disable blacklisting by setting " +
s"spark.blacklist.enabled=false as a workaround if you see this error frequently."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might also want to check whether blacklist is actually enabled here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants