-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31395][CORE]reverse preferred location to make schedule more even #28168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
@ChenjunZou Can you explain why and how it schedule evenly? I can't follow why. Also, please keep the Github PR template (https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE). |
Thanks @HyukjinKwon for reminding that. TaskSetManager will first seek the free executors in xxx.02 xxx.04 xxx.05, then executors in xxx.93 because they are process_locality level. |
Ngone51
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ChenjunZou , you can disable spark.shuffle.reduceLocality.enabled if you don't want locality prefer scheduling.
Or you may consider about some certain workloads where throughput is affected by delay scheduling? If so, you may interested at this PR #27207.
thanks @Ngone51 for reminding me that, I will watch the pr. |
|
Closing this - seems you just want to don't take the locality into account, which is already possible by the configuration |
@HyukjinKwon Because my locality skew happens in the map stage, |
|
@zsxwing |
|
If that's the case, we should fix the configurations to take the locality into account, rather than reversing the hosts. @ChenjunZou, please clarify why and how reversing the hosts can resolve your problem. From what you said, reversing will just switch the hot spot nodes to happen. |
@HyukjinKwon When spark schedules tasks. besides, I agree to add configurations. |
the write pipeline is various。 |
|
So are you saying you have 3 replica in three nodes and Spark job is only being executed in one specific node because of the locality? Then, how does reversing hosts help? You shouldn't use your driver node as a cluster ideally. In production you should better use Yarn cluster mode for such reason as an example. You're arguing that one specific case the driver and executor exist in one specific note together, and the workload is heavy in the specific node. What if the last node has both driver and executor? Reversing hosts doesn't solve anything. |
|
this pr is not something important. It just wants to reduce the possibility that a single node to be an unnecessary hot spot. |
|
Thanks for your explanation @HyukjinKwon
what does this mean.
Actually I use the cluster mode. if I use client mode, the client writes HDFS block will be even.
that is unrelated. the PROBLEM is: |
|
If there's a cluster, and the block is being written into single specific node only, seems an issue in HDFS then. |
|
In the runtime it is something like xxx.93 (driver) the executors in xxx.93 are scheduled more preferably. |
The client which is in HDFS data nodes, will write a copy to itself. That is a common behavior. I am glad we get some common knowledge here :) |
not limited to reverse. shuffle is OK. |
|
No~ I think you said you faced this issue when you run the applications in a yarn cluster mode where the driver runs on a different node. This is initially what I meant. |
|
What cluster mode do you use? If What I am saying is, how reserving hosts can solve the problem. The last node xxx.102, executor can be a driver too. |
|
you still don't understand I mean. ~ |
I agree. xxx.93 is what you said of "a different node". |
I agree . |
|
@ChenjunZou How can you make sure that front nodes in the reversed list are available to serve tasks? What if others are busy except And what if the location lists are then, do you still prefer "reverse" here? IMO, the status of scheduling is quite complex and undermined at runtime. So, I don't think such "reverse" could solve the problem. And I do think that you should try #27207 firstly as it really ease the problem you mentioned here. |
|
I only want is to make the schedule more even with minor effort.
It schedules to xxx.93. |
What changes were proposed in this pull request?
Let scheduler read preferred location reversely,
for instance,
block locations
[xxx.93. xxx.100 xxx.02]
[xxx.93 xxx.102 xxx.04]
[xxx.93 xxx.66 xxx.05]
for now, the executors in xxx.93 will firstly be scheduled, then executors in other locations,
after modification
the scheduling result is more even.
It is more obvious in small clusters.
Why are the changes needed?
because such a hot spot is unnecessary
Does this PR introduce any user-facing change?
No
How was this patch tested?
Manually test