Skip to content

Conversation

@ash211
Copy link
Contributor

@ash211 ash211 commented Sep 24, 2014

cc @kayousterhout

I have a few outstanding questions from compiling this documentation:

  • What's the difference between NO_PREF and ANY? I understand the implications of the ordering but don't know what an example of each would be
  • Why is NO_PREF ahead of RACK_LOCAL? I would think it'd be better to schedule rack-local tasks ahead of no preference if you could only do one or the other. Is the idea to wait longer and hope for the rack-local tasks to turn into node-local or better?
  • Will there be a datacenter-local locality level in the future? Apache Cassandra for example has this level

@SparkQA
Copy link

SparkQA commented Sep 24, 2014

QA tests have started for PR 2519 at commit 20e0e31.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 24, 2014

QA tests have finished for PR 2519 at commit 20e0e31.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 24, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20746/

@rnowling
Copy link
Contributor

Ok, I realized that you had the same questions I did about the ordering so I removed my old comment. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would link to the configuration page instead of enumerating the configs here. We try not to have two copies of things like this in the docs or else people could forget to update this.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have started for PR 2519 at commit 44cff28.

  • This patch merges cleanly.

@ash211
Copy link
Contributor Author

ash211 commented Sep 26, 2014

My recent commits address @pwendell 's comments but I'd like to include an answer to my first two bullet points from the summary before merging:

  • What's the difference between NO_PREF and ANY? I understand the implications of the ordering but don't know what an example of each would be
  • Why is NO_PREF ahead of RACK_LOCAL? I would think it'd be better to schedule rack-local tasks ahead of no preference if you could only do one or the other. Is the idea to wait longer and hope for the rack-local tasks to turn into node-local or better?

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have finished for PR 2519 at commit 44cff28.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20842/

@ash211
Copy link
Contributor Author

ash211 commented Nov 14, 2014

In the absence of feedback about the above questions and in an effort to clarify this at least somewhat in the docs, I think we should merge this docs-only PR as-is for the Spark 1.2.0 release. We can always extend the docs later with clarifications if needed.

@pwendell would you please merge?

@pwendell
Copy link
Contributor

Hey @ash211 I'm going to pull this in, thanks for working on it. One thing I do wonder is if there are more actionable take-aways from this for users. In my experience the defaults are usually just fine, it's not super clear to me when users would need to tune this.

asfgit pushed a commit that referenced this pull request Dec 10, 2014
cc kayousterhout

I have a few outstanding questions from compiling this documentation:
- What's the difference between NO_PREF and ANY?  I understand the implications of the ordering but don't know what an example of each would be
- Why is NO_PREF ahead of RACK_LOCAL?  I would think it'd be better to schedule rack-local tasks ahead of no preference if you could only do one or the other.  Is the idea to wait longer and hope for the rack-local tasks to turn into node-local or better?
- Will there be a datacenter-local locality level in the future?  Apache Cassandra for example has this level

Author: Andrew Ash <[email protected]>

Closes #2519 from ash211/SPARK-3526 and squashes the following commits:

44cff28 [Andrew Ash] Link to spark.locality parameters rather than copying the list
6d5d966 [Andrew Ash] Stay focused on Spark, no astronaut architecture mumbo-jumbo
20e0e31 [Andrew Ash] SPARK-3526 Add section about data locality to the tuning guide

(cherry picked from commit 652b781)
Signed-off-by: Patrick Wendell <[email protected]>
@asfgit asfgit closed this in 652b781 Dec 10, 2014
@ash211
Copy link
Contributor Author

ash211 commented Dec 11, 2014

Agreed that there's probably not a ton that's immediately tunable. But someone looking to "make it faster" could read this section, realize that they have bad locality, and move their HDFS and Spark workers closer together as a result.

I view this page as part prescriptive and part informative, and this section is definitely more on the informative side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants