[SPARK-31395][CORE]reverse preferred location to make schedule more even #28168

ChenjunZou · 2020-04-09T11:07:13Z

What changes were proposed in this pull request?

Let scheduler read preferred location reversely,
for instance,
block locations
[xxx.93. xxx.100 xxx.02]
[xxx.93 xxx.102 xxx.04]
[xxx.93 xxx.66 xxx.05]
for now, the executors in xxx.93 will firstly be scheduled, then executors in other locations,

after modification
the scheduling result is more even.

It is more obvious in small clusters.

Why are the changes needed?

because such a hot spot is unnecessary

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually test

AmplabJenkins · 2020-04-09T11:11:22Z

Can one of the admins verify this patch?

HyukjinKwon · 2020-04-09T11:13:05Z

@ChenjunZou Can you explain why and how it schedule evenly? I can't follow why. Also, please keep the Github PR template (https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE).

ChenjunZou · 2020-04-09T11:26:14Z

@ChenjunZou Can you explain why and how it schedule evenly? I can't follow why. Also, please keep the Github PR template (https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE).

Thanks @HyukjinKwon for reminding that.

TaskSetManager will first seek the free executors in xxx.02 xxx.04 xxx.05, then executors in xxx.93 because they are process_locality level.
After that, It schedules to higher locality level.

Ngone51

Hi @ChenjunZou , you can disable spark.shuffle.reduceLocality.enabled if you don't want locality prefer scheduling.

Or you may consider about some certain workloads where throughput is affected by delay scheduling? If so, you may interested at this PR #27207.

ChenjunZou · 2020-04-09T11:39:34Z

Hi @ChenjunZou , you can disable spark.shuffle.reduceLocality.enabled if you don't want locality prefer scheduling.

Or you may consider about some certain workloads where throughput is affected by delay scheduling? If so, you may interested at this PR #27207.

thanks @Ngone51 for reminding me that, I will watch the pr.

HyukjinKwon · 2020-04-09T11:50:15Z

Closing this - seems you just want to don't take the locality into account, which is already possible by the configuration spark.shuffle.reduceLocality.enabled or spark.locality.wait

ChenjunZou · 2020-04-09T11:55:58Z

Closing this - seems you just want to don't take the locality into account, which is already possible by the configuration spark.shuffle.reduceLocality.enabled or spark.locality.wait

@HyukjinKwon
I don't think spark.shuffle.reduceLocality.enabled or spark.locality.wait is enough to solve my scenario.

Because my locality skew happens in the map stage,
Second I already set scheduler.locality.wait to 0.
It still has a hot spot.

ChenjunZou · 2020-04-09T11:57:02Z

@zsxwing
can you have a look.

HyukjinKwon · 2020-04-09T12:09:22Z

If that's the case, we should fix the configurations to take the locality into account, rather than reversing the hosts. @ChenjunZou, please clarify why and how reversing the hosts can resolve your problem.

From what you said, reversing will just switch the hot spot nodes to happen.

ChenjunZou · 2020-04-09T12:32:19Z

If that's the case, we should fix the configurations to take the locality into account, rather than reversing the hosts. @ChenjunZou, please clarify why and how reversing the hosts can resolve your problem.

From what you said, reversing will just switch the hot spot nodes to happen.

@HyukjinKwon
The root cause is the model, or the data is written from a single node. ( xxx.93 for instance)
So as the HDFS writing pipeline, the client firstly writes to xxx.93, plus another two nodes.
the preferred locations are like that:
[xxx.93. xxx.100 xxx.02]
[xxx.93 xxx.102 xxx.04]
[xxx.93 xxx.66 xxx.05]

When spark schedules tasks.
the executors in xxx.93 are always preferred by spark scheduler. other executors rarely get tasks when they (in xxx.93）are all busy. The single hot spot should be avoided.

besides, I agree to add configurations.

ChenjunZou · 2020-04-09T12:33:45Z

From what you said, reversing will just switch the hot spot nodes to happen.

the write pipeline is various。
It could be any two nodes。
the hot spot will be alliviated。

HyukjinKwon · 2020-04-09T12:39:53Z

So are you saying you have 3 replica in three nodes and Spark job is only being executed in one specific node because of the locality? Then, how does reversing hosts help?

You shouldn't use your driver node as a cluster ideally. In production you should better use Yarn cluster mode for such reason as an example.

You're arguing that one specific case the driver and executor exist in one specific note together, and the workload is heavy in the specific node. What if the last node has both driver and executor? Reversing hosts doesn't solve anything.

ChenjunZou · 2020-04-09T12:40:21Z

this pr is not something important.

It just wants to reduce the possibility that a single node to be an unnecessary hot spot.

ChenjunZou · 2020-04-09T12:49:41Z

Thanks for your explanation @HyukjinKwon
I have 2 questions more.

You shouldn't use our driver as a cluster ideally.

what does this mean.

In production you should better use cluster mode for such reason.

Actually I use the cluster mode. if I use client mode, the client writes HDFS block will be even.
single node hot will not happen.

What if the last node has both driver and executor?

that is unrelated. the PROBLEM is:
the first location of all blocks will always be the same xxx.93. which made scheduler try executors in that node first.

HyukjinKwon · 2020-04-09T12:54:09Z

If there's a cluster, and the block is being written into single specific node only, seems an issue in HDFS then.

ChenjunZou · 2020-04-09T12:56:49Z

In the runtime it is something like

xxx.93 (driver)
Starting task 0.0 in stage 0.0 (TID 0, xxx.93,executor
Starting task 1.0 in stage 0.0 (TID 1, xxx.93, executor
Starting task 2.0 in stage 0.0 (TID 1, xxx.93, executor
Starting task 3.0 in stage 0.0 (TID 1, xxx.93, executor
Starting task 4.0 in stage 0.0 (TID 1, xxx.93, executor
Starting task 5.0 in stage 0.0 (TID 1, xxx.63, executor
Starting task 6.0 in stage 0.0 (TID 1, xxx.102, executor

the executors in xxx.93 are scheduled more preferably.

ChenjunZou · 2020-04-09T12:58:58Z

If there's a cluster, and the block is being written into single node only, seems an issue in HDFS then.

The client which is in HDFS data nodes, will write a copy to itself. That is a common behavior.

I am glad we get some common knowledge here :)

ChenjunZou · 2020-04-09T13:01:54Z

Reversing hosts doesn't solve anything.

not limited to reverse. shuffle is OK.

HyukjinKwon · 2020-04-09T13:07:37Z

No~ I think you said you faced this issue when you run the applications in a yarn cluster mode where the driver runs on a different node. This is initially what I meant.

HyukjinKwon · 2020-04-09T13:10:22Z

What cluster mode do you use? If xxx.93 (driver) is a driver, and the problem is that the data is copied into that node, you should separate the driver out of the HDFS cluster or use Yarn cluster mode to evenly distribute in production.

What I am saying is, how reserving hosts can solve the problem. The last node xxx.102, executor can be a driver too.

ChenjunZou · 2020-04-09T14:51:17Z

you still don't understand I mean. ~
I use yarn cluster mode, xxx.93 is a node RM pick randomly for me to run driver on it.
and the executors also in node xxx.93 is more preferred to be scheduled to execute tasks.
Because I write some data in the driver's code and then do some RDD calculation.

ChenjunZou · 2020-04-09T14:53:35Z

No~ I think you said you faced this issue when you run the applications in a yarn cluster mode where the driver runs on a different node. This is initially what I meant.

I agree. xxx.93 is what you said of "a different node".

ChenjunZou · 2020-04-09T14:54:46Z

The last node xxx.102, executor can be a driver too.

I agree .
that node xxx.102 could also be a hot spot too if driver run on it.

Ngone51 · 2020-04-09T15:07:28Z

@ChenjunZou How can you make sure that front nodes in the reversed list are available to serve tasks? What if others are busy except xxx.93?

And what if the location lists are

[xxx.100 xxx.02 xxx.93]
[xxx.102 xxx.04 xxx.93]
[xxx.66  xxx.05 xxx.93]

then, do you still prefer "reverse" here?

IMO, the status of scheduling is quite complex and undermined at runtime. So, I don't think such "reverse" could solve the problem.

And I do think that you should try #27207 firstly as it really ease the problem you mentioned here.

ChenjunZou · 2020-04-09T16:43:54Z

@Ngone51

I only want is to make the schedule more even with minor effort.
I also said it may not reverse, shuffle is ok.
the reverse is because hdfs write pipeline gives enough randomness.

What if others are busy except xxx.93?

It schedules to xxx.93.
It is about fairness. If the writing process is a single point spot. The scheduler may add some random condition for schedule fairness.

reverse preferred location to make schedule more even

82a06af

ChenjunZou changed the title ~~[SPARK-31395][CORE]reverse preferred location to make schedule more even~~ [SPARK-31395][CORE]reverse preferred location to make schedule more evener Apr 9, 2020

ChenjunZou changed the title ~~[SPARK-31395][CORE]reverse preferred location to make schedule more evener~~ [SPARK-31395][CORE]reverse preferred location to make schedule more even Apr 9, 2020

Ngone51 reviewed Apr 9, 2020

View reviewed changes

HyukjinKwon closed this Apr 9, 2020

[SPARK-31395][CORE]reverse preferred location to make schedule more even #28168

[SPARK-31395][CORE]reverse preferred location to make schedule more even #28168

Uh oh!

Conversation

ChenjunZou commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Apr 9, 2020

Uh oh!

HyukjinKwon commented Apr 9, 2020

Uh oh!

ChenjunZou commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ngone51 left a comment

Choose a reason for hiding this comment

Uh oh!

ChenjunZou commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Apr 9, 2020

Uh oh!

ChenjunZou commented Apr 9, 2020

Uh oh!

ChenjunZou commented Apr 9, 2020

Uh oh!

HyukjinKwon commented Apr 9, 2020

Uh oh!

ChenjunZou commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenjunZou commented Apr 9, 2020

Uh oh!

HyukjinKwon commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenjunZou commented Apr 9, 2020

Uh oh!

ChenjunZou commented Apr 9, 2020

Uh oh!

HyukjinKwon commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenjunZou commented Apr 9, 2020

Uh oh!

ChenjunZou commented Apr 9, 2020

Uh oh!

ChenjunZou commented Apr 9, 2020

Uh oh!

HyukjinKwon commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Apr 9, 2020

Uh oh!

ChenjunZou commented Apr 9, 2020

Uh oh!

ChenjunZou commented Apr 9, 2020

Uh oh!

ChenjunZou commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ngone51 commented Apr 9, 2020

Uh oh!

ChenjunZou commented Apr 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ChenjunZou commented Apr 9, 2020 •

edited

Loading

ChenjunZou commented Apr 9, 2020 •

edited

Loading

ChenjunZou commented Apr 9, 2020 •

edited

Loading

ChenjunZou commented Apr 9, 2020 •

edited

Loading

HyukjinKwon commented Apr 9, 2020 •

edited

Loading

HyukjinKwon commented Apr 9, 2020 •

edited

Loading

HyukjinKwon commented Apr 9, 2020 •

edited

Loading

ChenjunZou commented Apr 9, 2020 •

edited

Loading