SPARK-3526 Add section about data locality to the tuning guide #2519

ash211 · 2014-09-24T08:56:01Z

cc @kayousterhout

I have a few outstanding questions from compiling this documentation:

What's the difference between NO_PREF and ANY? I understand the implications of the ordering but don't know what an example of each would be
Why is NO_PREF ahead of RACK_LOCAL? I would think it'd be better to schedule rack-local tasks ahead of no preference if you could only do one or the other. Is the idea to wait longer and hope for the rack-local tasks to turn into node-local or better?
Will there be a datacenter-local locality level in the future? Apache Cassandra for example has this level

SparkQA · 2014-09-24T08:59:24Z

QA tests have started for PR 2519 at commit 20e0e31.

This patch merges cleanly.

SparkQA · 2014-09-24T10:06:35Z

QA tests have finished for PR 2519 at commit 20e0e31.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-24T10:06:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20746/

rnowling · 2014-09-24T19:35:23Z

Ok, I realized that you had the same questions I did about the ordering so I removed my old comment. :)

pwendell · 2014-09-25T04:09:40Z

docs/tuning.md

Here I would link to the configuration page instead of enumerating the configs here. We try not to have two copies of things like this in the docs or else people could forget to update this.

SparkQA · 2014-09-26T04:14:28Z

QA tests have started for PR 2519 at commit 44cff28.

This patch merges cleanly.

ash211 · 2014-09-26T04:14:50Z

My recent commits address @pwendell 's comments but I'd like to include an answer to my first two bullet points from the summary before merging:

What's the difference between NO_PREF and ANY? I understand the implications of the ordering but don't know what an example of each would be
Why is NO_PREF ahead of RACK_LOCAL? I would think it'd be better to schedule rack-local tasks ahead of no preference if you could only do one or the other. Is the idea to wait longer and hope for the rack-local tasks to turn into node-local or better?

SparkQA · 2014-09-26T05:23:34Z

QA tests have finished for PR 2519 at commit 44cff28.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-09-26T05:23:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20842/

ash211 · 2014-11-14T08:40:08Z

In the absence of feedback about the above questions and in an effort to clarify this at least somewhat in the docs, I think we should merge this docs-only PR as-is for the Spark 1.2.0 release. We can always extend the docs later with clarifications if needed.

@pwendell would you please merge?

pwendell · 2014-12-10T23:00:49Z

Hey @ash211 I'm going to pull this in, thanks for working on it. One thing I do wonder is if there are more actionable take-aways from this for users. In my experience the defaults are usually just fine, it's not super clear to me when users would need to tune this.

cc kayousterhout I have a few outstanding questions from compiling this documentation: - What's the difference between NO_PREF and ANY? I understand the implications of the ordering but don't know what an example of each would be - Why is NO_PREF ahead of RACK_LOCAL? I would think it'd be better to schedule rack-local tasks ahead of no preference if you could only do one or the other. Is the idea to wait longer and hope for the rack-local tasks to turn into node-local or better? - Will there be a datacenter-local locality level in the future? Apache Cassandra for example has this level Author: Andrew Ash <[email protected]> Closes #2519 from ash211/SPARK-3526 and squashes the following commits: 44cff28 [Andrew Ash] Link to spark.locality parameters rather than copying the list 6d5d966 [Andrew Ash] Stay focused on Spark, no astronaut architecture mumbo-jumbo 20e0e31 [Andrew Ash] SPARK-3526 Add section about data locality to the tuning guide (cherry picked from commit 652b781) Signed-off-by: Patrick Wendell <[email protected]>

ash211 · 2014-12-11T23:45:32Z

Agreed that there's probably not a ton that's immediately tunable. But someone looking to "make it faster" could read this section, realize that they have bad locality, and move their HDFS and Spark workers closer together as a result.

I view this page as part prescriptive and part informative, and this section is definitely more on the informative side.

SPARK-3526 Add section about data locality to the tuning guide

20e0e31

pwendell reviewed Sep 25, 2014
View reviewed changes

ash211 added 2 commits September 25, 2014 21:01

Stay focused on Spark, no astronaut architecture mumbo-jumbo

6d5d966

Link to spark.locality parameters rather than copying the list

44cff28

asfgit closed this in 652b781 Dec 10, 2014

SPARK-3526 Add section about data locality to the tuning guide #2519

SPARK-3526 Add section about data locality to the tuning guide #2519

Uh oh!

Conversation

ash211 commented Sep 24, 2014

Uh oh!

SparkQA commented Sep 24, 2014

Uh oh!

SparkQA commented Sep 24, 2014

Uh oh!

SparkQA commented Sep 24, 2014

Uh oh!

rnowling commented Sep 24, 2014

Uh oh!

pwendell Sep 25, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 26, 2014

Uh oh!

ash211 commented Sep 26, 2014

Uh oh!

SparkQA commented Sep 26, 2014

Uh oh!

AmplabJenkins commented Sep 26, 2014

Uh oh!

ash211 commented Nov 14, 2014

Uh oh!

pwendell commented Dec 10, 2014

Uh oh!

ash211 commented Dec 11, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants