-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-3526 Add section about data locality to the tuning guide #2519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -247,6 +247,39 @@ Spark prints the serialized size of each task on the master, so you can look at | |
| decide whether your tasks are too large; in general tasks larger than about 20 KB are probably | ||
| worth optimizing. | ||
|
|
||
| ## Data Locality | ||
|
|
||
| One of the most important principles of distributed computing is data locality. If data and the | ||
| code that operates on it are together than computation tends to be fast. But if code and data are | ||
| separated, one must move to the other. Typically it is faster to ship serialized code from place to | ||
| place than a chunk of data because code size is much smaller than data. Spark builds its scheduling | ||
| around this general principle of data locality. | ||
|
|
||
| Data locality is how close data is to the code processing it. There are several levels of | ||
| locality based on the data's current location. In order from closest to farthest: | ||
|
|
||
| - `PROCESS_LOCAL` data is in the same JVM as the running code. This is the best locality | ||
| possible | ||
| - `NODE_LOCAL` data is on the same node. Examples might be in HDFS on the same node, or in | ||
| another executor on the same node. This is a little slower than `PROCESS_LOCAL` because the data | ||
| has to travel between processes | ||
| - `NO_PREF` data is accessed equally quickly from anywhere and has no locality preference | ||
| - `RACK_LOCAL` data is on the same rack of servers. Data is on a different server on the same rack | ||
| so needs to be sent over the network, typically through a single switch | ||
| - `ANY` data is elsewhere on the network and not in the same rack | ||
|
|
||
| Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In | ||
| situations where there is no unprocessed data on any idle executor, Spark switches to lower locality | ||
| levels. There are two options: a) wait until a busy CPU frees up to start a task on data on the same | ||
| server, or b) immediately start a new task in a farther away place that requires moving data there. | ||
|
|
||
| What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout | ||
| expires, it starts moving the data from far away to the free CPU. The wait timeout for fallback | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here I would link to the configuration page instead of enumerating the configs here. We try not to have two copies of things like this in the docs or else people could forget to update this. |
||
| between each level can be configured individually via `spark.locality.wait.process` and | ||
| `spark.locality.wait.node` and `spark.locality.wait.rack`, or all together via `spark.locality.wait` | ||
| You should increase these settings if your tasks are long and see poor locality, but the default | ||
| usually works well. | ||
|
|
||
| # Summary | ||
|
|
||
| This has been a short guide to point out the main concerns you should know about when tuning a | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to say something more like "Data locality can have a major impact on the performance of Spark jobs."