Skip to content

Conversation

@ChenjunZou
Copy link

@ChenjunZou ChenjunZou commented Apr 9, 2020

What changes were proposed in this pull request?

Let scheduler read preferred location reversely,
for instance,
block locations
[xxx.93. xxx.100 xxx.02]
[xxx.93 xxx.102 xxx.04]
[xxx.93 xxx.66 xxx.05]
for now, the executors in xxx.93 will firstly be scheduled, then executors in other locations,

after modification
the scheduling result is more even.

It is more obvious in small clusters.

Why are the changes needed?

because such a hot spot is unnecessary

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually test

@ChenjunZou ChenjunZou changed the title [SPARK-31395][CORE]reverse preferred location to make schedule more even [SPARK-31395][CORE]reverse preferred location to make schedule more evener Apr 9, 2020
@ChenjunZou ChenjunZou changed the title [SPARK-31395][CORE]reverse preferred location to make schedule more evener [SPARK-31395][CORE]reverse preferred location to make schedule more even Apr 9, 2020
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

@ChenjunZou Can you explain why and how it schedule evenly? I can't follow why. Also, please keep the Github PR template (https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE).

@ChenjunZou
Copy link
Author

ChenjunZou commented Apr 9, 2020

@ChenjunZou Can you explain why and how it schedule evenly? I can't follow why. Also, please keep the Github PR template (https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE).

Thanks @HyukjinKwon for reminding that.

TaskSetManager will first seek the free executors in xxx.02 xxx.04 xxx.05, then executors in xxx.93 because they are process_locality level.
After that, It schedules to higher locality level.

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ChenjunZou , you can disable spark.shuffle.reduceLocality.enabled if you don't want locality prefer scheduling.

Or you may consider about some certain workloads where throughput is affected by delay scheduling? If so, you may interested at this PR #27207.

@ChenjunZou
Copy link
Author

ChenjunZou commented Apr 9, 2020

Hi @ChenjunZou , you can disable spark.shuffle.reduceLocality.enabled if you don't want locality prefer scheduling.

Or you may consider about some certain workloads where throughput is affected by delay scheduling? If so, you may interested at this PR #27207.

thanks @Ngone51 for reminding me that, I will watch the pr.

@HyukjinKwon
Copy link
Member

Closing this - seems you just want to don't take the locality into account, which is already possible by the configuration spark.shuffle.reduceLocality.enabled or spark.locality.wait

@HyukjinKwon HyukjinKwon closed this Apr 9, 2020
@ChenjunZou
Copy link
Author

Closing this - seems you just want to don't take the locality into account, which is already possible by the configuration spark.shuffle.reduceLocality.enabled or spark.locality.wait

@HyukjinKwon
I don't think spark.shuffle.reduceLocality.enabled or spark.locality.wait is enough to solve my scenario.

Because my locality skew happens in the map stage,
Second I already set scheduler.locality.wait to 0.
It still has a hot spot.

@ChenjunZou
Copy link
Author

@zsxwing
can you have a look.

@HyukjinKwon
Copy link
Member

If that's the case, we should fix the configurations to take the locality into account, rather than reversing the hosts. @ChenjunZou, please clarify why and how reversing the hosts can resolve your problem.

From what you said, reversing will just switch the hot spot nodes to happen.

@ChenjunZou
Copy link
Author

ChenjunZou commented Apr 9, 2020

If that's the case, we should fix the configurations to take the locality into account, rather than reversing the hosts. @ChenjunZou, please clarify why and how reversing the hosts can resolve your problem.

From what you said, reversing will just switch the hot spot nodes to happen.

@HyukjinKwon
The root cause is the model, or the data is written from a single node. ( xxx.93 for instance)
So as the HDFS writing pipeline, the client firstly writes to xxx.93, plus another two nodes.
the preferred locations are like that:
[xxx.93. xxx.100 xxx.02]
[xxx.93 xxx.102 xxx.04]
[xxx.93 xxx.66 xxx.05]

When spark schedules tasks.
the executors in xxx.93 are always preferred by spark scheduler. other executors rarely get tasks when they (in xxx.93)are all busy. The single hot spot should be avoided.

besides, I agree to add configurations.

@ChenjunZou
Copy link
Author

From what you said, reversing will just switch the hot spot nodes to happen.

the write pipeline is various。
It could be any two nodes。
the hot spot will be alliviated。

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 9, 2020

So are you saying you have 3 replica in three nodes and Spark job is only being executed in one specific node because of the locality? Then, how does reversing hosts help?

You shouldn't use your driver node as a cluster ideally. In production you should better use Yarn cluster mode for such reason as an example.

You're arguing that one specific case the driver and executor exist in one specific note together, and the workload is heavy in the specific node. What if the last node has both driver and executor? Reversing hosts doesn't solve anything.

@ChenjunZou
Copy link
Author

this pr is not something important.

It just wants to reduce the possibility that a single node to be an unnecessary hot spot.

@ChenjunZou
Copy link
Author

Thanks for your explanation @HyukjinKwon
I have 2 questions more.

You shouldn't use our driver as a cluster ideally.

what does this mean.

In production you should better use cluster mode for such reason.

Actually I use the cluster mode. if I use client mode, the client writes HDFS block will be even.
single node hot will not happen.

What if the last node has both driver and executor?

that is unrelated. the PROBLEM is:
the first location of all blocks will always be the same xxx.93. which made scheduler try executors in that node first.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 9, 2020

If there's a cluster, and the block is being written into single specific node only, seems an issue in HDFS then.

@ChenjunZou
Copy link
Author

In the runtime it is something like

xxx.93 (driver)
Starting task 0.0 in stage 0.0 (TID 0, xxx.93,executor
Starting task 1.0 in stage 0.0 (TID 1, xxx.93, executor
Starting task 2.0 in stage 0.0 (TID 1, xxx.93, executor
Starting task 3.0 in stage 0.0 (TID 1, xxx.93, executor
Starting task 4.0 in stage 0.0 (TID 1, xxx.93, executor
Starting task 5.0 in stage 0.0 (TID 1, xxx.63, executor
Starting task 6.0 in stage 0.0 (TID 1, xxx.102, executor

the executors in xxx.93 are scheduled more preferably.

@ChenjunZou
Copy link
Author

If there's a cluster, and the block is being written into single node only, seems an issue in HDFS then.

The client which is in HDFS data nodes, will write a copy to itself. That is a common behavior.

I am glad we get some common knowledge here :)

@ChenjunZou
Copy link
Author

Reversing hosts doesn't solve anything.

not limited to reverse. shuffle is OK.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 9, 2020

No~ I think you said you faced this issue when you run the applications in a yarn cluster mode where the driver runs on a different node. This is initially what I meant.

@HyukjinKwon
Copy link
Member

What cluster mode do you use? If xxx.93 (driver) is a driver, and the problem is that the data is copied into that node, you should separate the driver out of the HDFS cluster or use Yarn cluster mode to evenly distribute in production.

What I am saying is, how reserving hosts can solve the problem. The last node xxx.102, executor can be a driver too.

@ChenjunZou
Copy link
Author

you still don't understand I mean. ~
I use yarn cluster mode, xxx.93 is a node RM pick randomly for me to run driver on it.
and the executors also in node xxx.93 is more preferred to be scheduled to execute tasks.
Because I write some data in the driver's code and then do some RDD calculation.

@ChenjunZou
Copy link
Author

No~ I think you said you faced this issue when you run the applications in a yarn cluster mode where the driver runs on a different node. This is initially what I meant.

I agree. xxx.93 is what you said of "a different node".

@ChenjunZou
Copy link
Author

ChenjunZou commented Apr 9, 2020

The last node xxx.102, executor can be a driver too.

I agree .
that node xxx.102 could also be a hot spot too if driver run on it.

@Ngone51
Copy link
Member

Ngone51 commented Apr 9, 2020

@ChenjunZou How can you make sure that front nodes in the reversed list are available to serve tasks? What if others are busy except xxx.93?

And what if the location lists are

[xxx.100 xxx.02 xxx.93]
[xxx.102 xxx.04 xxx.93]
[xxx.66  xxx.05 xxx.93]

then, do you still prefer "reverse" here?

IMO, the status of scheduling is quite complex and undermined at runtime. So, I don't think such "reverse" could solve the problem.

And I do think that you should try #27207 firstly as it really ease the problem you mentioned here.

@ChenjunZou
Copy link
Author

@Ngone51

I only want is to make the schedule more even with minor effort.
I also said it may not reverse, shuffle is ok.
the reverse is because hdfs write pipeline gives enough randomness.

What if others are busy except xxx.93?

It schedules to xxx.93.
It is about fairness. If the writing process is a single point spot. The scheduler may add some random condition for schedule fairness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants