-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31837][CORE] Shift to the new highest locality level if there is when recomputeLocality #28656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ping @bmarcott @cloud-fan @tgravescs also cc @jiangxb1987 @dongjoon-hyun since this's the root cause for the flaky of #28584 |
| if (currentLocalityIndex > previousLocalityIndex) { | ||
| // SPARK-31837: there's new higher level locality, so shift to | ||
| // the highest locality level in terms of better data locality | ||
| currentLocalityIndex = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern of this change is, for general cases, whether the task set would take more time on delay scheduling and thus reduce the throughput of the cluster. But considering the improvement that we gain(to increase the throughput of the cluster) from the new version of the delay scheduling, I think it should not be a big problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns with this. This is going to reset it back to highest level all the time and then you will have to work your way back through the levels, waiting at each level. This could really delay things a lot if you are down to the ANY level and you have to wait 3 seconds between each one. The problem you are describing is only on startup when you have no executors, correct? Perhaps we can look at something more specific for that.
| localityWaits = myLocalityLevels.map(getLocalityWait) | ||
| currentLocalityIndex = getLocalityIndex(previousLocalityLevel) | ||
| if (currentLocalityIndex > previousLocalityIndex) { | ||
| // SPARK-31837: there's new higher level locality, so shift to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
level locality -> locality level
|
can you describe the problem in more detail. Is it these tasks are coming in with preferred locality but by the time you get an executor it has already waited the timeout for each of process, node, rack, so you get to ANY? |
|
This reminds me of #27207 (comment) It's not a big problem, as we only violate delay scheduling in the first resource offer. But still good to fix it. |
|
@tgravescs
And then, spark/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala Lines 1107 to 1114 in 8b48629
As a result, the TaskSetManager can only schedule tasks at ANY level. (I'll update the PR description with more detail explanation later) |
|
Right when you have no tasks yet it gets set to ANY, but this wasn't changed, that was like that before (the changes to locality also) and when an executor was added it never reset it back to the highest, it just called recomputeLocality again which would have left it at ANY as well. The issue I think on resetting it on every recomputeLocality is when you have dynamic allocation and you are adding a bunch of executors. You basically revert back to similar to the old behavior where you could really be delaying vs using executors you already have. You are delaying those tasks that could be running more than we should and wasting the executor resources you have. let me look into this a bit more. |
For newly allocated executors, mostly they don't have any local data to trigger the level reset. To me, this is mostly for the first resource offer to a task set manager. |
|
that isn't true if you are on something like YARN and HDFS where data can be on the node. Ideally I think we reset it when executor added if there aren't any or if it all executors were used and we added one, I just need time to look through the code. |
|
Test build #123195 has finished for PR 28656 at commit
|
|
Test build #123193 has finished for PR 28656 at commit
|
tgravescs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So another thought here is to only reset this when the locality levels actually changed. Meaning if we already have the levels of node and any, we don't need to reset it because we already waited the time for node wait. We only need to set it back to index 0 when we were at just ANY and we have now added in Node. So if the new computed levels are more local than the current ones. Even that may be a bit aggressive but I think is better than resetting it any time we are at a less local state.
This gave me a few ideas though and I thought of a few corner cases we should try to handle better. I'll can file a follow on lira for that though.
| myLocalityLevels = computeValidLocalityLevels() | ||
| localityWaits = myLocalityLevels.map(getLocalityWait) | ||
| currentLocalityIndex = getLocalityIndex(previousLocalityLevel) | ||
| if (currentLocalityIndex > previousLocalityIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should change this to only happen when executorAdded. this is also called on lost and decommission and it doesn't make sense to go to "lower" level. Note we may want to stay away from saying higher in the comment below. The code values, lower is actually more strict - meaning process is lowest value. so perhaps don't say higher or lower but say more local or less local. Perhaps pass in parameter from executorAdded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should change this to only happen when executorAdded. this is also called on lost and decommission and it doesn't make sense to go to "lower" level.
I think it's impossible to go to "lower" or more local level in case of lost and decommission. Lost and decommission would remove executors, so the locality levels can only be less compared to the previous locality levels. It also means, lost and decommission will not add new more local levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so perhaps don't say higher or lower but say more local or less local.
Yea, good point!
Sure. |
I think the current condition( |
|
Test build #123225 has finished for PR 28656 at commit
|
|
ah, sorry I missed that, it is covered. This looks good to me then. I can file another jlira to investigate possibly handling better. |
| localityWaits = myLocalityLevels.map(getLocalityWait) | ||
| currentLocalityIndex = getLocalityIndex(previousLocalityLevel) | ||
| if (currentLocalityIndex > previousLocalityIndex) { | ||
| // SPARK-31837: there's new higher locality level, so shift to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about If the new level is more local, shift to the most local level?
|
can we also remove the hack mentioned in #27207 (comment) ? |
|
@Ngone51 Let's make a few corrections to the description:
I do see how this handles the race condition on startup, although a more specific handling of that case may be preferable (@tgravescs also mentioned). Deferring to @tgravescs and @cloud-fan to make the call here. |
|
I filed a followup jira https://issues.apache.org/jira/browse/SPARK-31856?filter=-2 to look at handling the executor added better, this fix works for now. |
| // level in terms of better data locality. For example, say the previous locality | ||
| // levels are [PROCESS, NODE, ANY] and current level is ANY. After recompute, the | ||
| // locality levels are [PROCESS, NODE, RACK, ANY]. Then, we'll shift to RACK level. | ||
| currentLocalityIndex = getLocalityIndex(myLocalityLevels.diff(previousMyLocalityLevels).head) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi all, there's a defect in the previous implement(always reset currentLocalityIndex to 0). Think about such a case, say we have locality levels [PROCESS, NODE, ANY] and current locality level is ANY. After recompute, we might have locality levels [PROCESS, NODE, RACK, ANY]. In this case, I think we'd better shift to RACK level instead of PROCESS level, since the TaskSetManager has been already delayed for a while on known levels(PROCESS, NODE). So with this update, I think it could also ease our concern on the possible perf regression introduced by aggressive locality level resetting. @bmarcott @tgravescs @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is one of the cases I was referring to. Ideally you would never run into this case because a host is on a rack so you would always have it. Unfortunately Spark defaults the rack to None so you can. I was going to improve upon it in the jira I filed. We can certainly handle some here if you want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can certainly handle some here if you want
What do you mean by "handle some here"? I read your JIRA and don't find the specific solution that could be added to this PR. Could you please elaborate more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just mean there are a bunch of corner cases and I don't think resetting this on every executor added is ideal. I did not list out all of them. On Yarn it actually defaults to default-rack rather then None so it actually won't have this issue because every node has a rack. I agree that the code you have here is an improvement to handle the rack case.
|
Test build #123273 has finished for PR 28656 at commit
|
|
retest this please. |
|
Test build #123290 has finished for PR 28656 at commit
|
|
retest this please |
|
Test build #123325 has finished for PR 28656 at commit
|
|
retest this please |
|
Test build #123346 has finished for PR 28656 at commit
|
|
thanks, merging to master! |
|
thanks all! |
What changes were proposed in this pull request?
This PR proposes to shift to the new most local locality level if there're any new more local locality levels are added during
TaskSetManager.recomputeLocality.Why are the changes needed?
There's a race condition between
resourceOffersandsubmitTasks. IfsubmitTaskshappens beforeresourceOffers, especially when there are no executors added toTaskSchedulerImplat all, theTaskSetManager'smyLocalityLevelswill only have ANY locality level, see:spark/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Line 218 in 8b48629
And then,
resourceOffersis called with new executors added toTaskSchedulerImpl. Then,recomputeLocalitywill be called because ofexecutorAdded. DuringrecomputeLocality, the TaskSetManager'smyLocalityLevelsmight have PROCESS_LOCAL, NODE_LOCAL, ANY(because at this time we could find alive executors from TaskSchedulerImpl). But theTaskSetManagerwill stick to the previous locality level, which is ANY, see:spark/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Lines 1107 to 1114 in 8b48629
As a result, in the first round of
resourceOffers, the new version delay scheduling won't take effect andTaskSetManagercan only schedule tasks at ANY level.Please note that the problem also exists in old version delay scheduling but the impact is minor because we always reset the locality level after successfully launching a task, which is broken in the new version of dealy scheduling.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Updated and added new test.