Skip to content

Commit 652b781

Browse files
ash211pwendell
authored andcommitted
SPARK-3526 Add section about data locality to the tuning guide
cc kayousterhout I have a few outstanding questions from compiling this documentation: - What's the difference between NO_PREF and ANY? I understand the implications of the ordering but don't know what an example of each would be - Why is NO_PREF ahead of RACK_LOCAL? I would think it'd be better to schedule rack-local tasks ahead of no preference if you could only do one or the other. Is the idea to wait longer and hope for the rack-local tasks to turn into node-local or better? - Will there be a datacenter-local locality level in the future? Apache Cassandra for example has this level Author: Andrew Ash <[email protected]> Closes #2519 from ash211/SPARK-3526 and squashes the following commits: 44cff28 [Andrew Ash] Link to spark.locality parameters rather than copying the list 6d5d966 [Andrew Ash] Stay focused on Spark, no astronaut architecture mumbo-jumbo 20e0e31 [Andrew Ash] SPARK-3526 Add section about data locality to the tuning guide
1 parent 36bdb5b commit 652b781

File tree

1 file changed

+33
-0
lines changed

1 file changed

+33
-0
lines changed

docs/tuning.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,39 @@ Spark prints the serialized size of each task on the master, so you can look at
233233
decide whether your tasks are too large; in general tasks larger than about 20 KB are probably
234234
worth optimizing.
235235

236+
## Data Locality
237+
238+
Data locality can have a major impact on the performance of Spark jobs. If data and the code that
239+
operates on it are together than computation tends to be fast. But if code and data are separated,
240+
one must move to the other. Typically it is faster to ship serialized code from place to place than
241+
a chunk of data because code size is much smaller than data. Spark builds its scheduling around
242+
this general principle of data locality.
243+
244+
Data locality is how close data is to the code processing it. There are several levels of
245+
locality based on the data's current location. In order from closest to farthest:
246+
247+
- `PROCESS_LOCAL` data is in the same JVM as the running code. This is the best locality
248+
possible
249+
- `NODE_LOCAL` data is on the same node. Examples might be in HDFS on the same node, or in
250+
another executor on the same node. This is a little slower than `PROCESS_LOCAL` because the data
251+
has to travel between processes
252+
- `NO_PREF` data is accessed equally quickly from anywhere and has no locality preference
253+
- `RACK_LOCAL` data is on the same rack of servers. Data is on a different server on the same rack
254+
so needs to be sent over the network, typically through a single switch
255+
- `ANY` data is elsewhere on the network and not in the same rack
256+
257+
Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In
258+
situations where there is no unprocessed data on any idle executor, Spark switches to lower locality
259+
levels. There are two options: a) wait until a busy CPU frees up to start a task on data on the same
260+
server, or b) immediately start a new task in a farther away place that requires moving data there.
261+
262+
What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout
263+
expires, it starts moving the data from far away to the free CPU. The wait timeout for fallback
264+
between each level can be configured individually or all together in one parameter; see the
265+
`spark.locality` parameters on the [configuration page](configuration.html#scheduling) for details.
266+
You should increase these settings if your tasks are long and see poor locality, but the default
267+
usually works well.
268+
236269
# Summary
237270

238271
This has been a short guide to point out the main concerns you should know about when tuning a

0 commit comments

Comments
 (0)