-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-3526 Add section about data locality to the tuning guide #2519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2519 at commit
|
|
QA tests have finished for PR 2519 at commit
|
|
Test PASSed. |
|
Ok, I realized that you had the same questions I did about the ordering so I removed my old comment. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I would link to the configuration page instead of enumerating the configs here. We try not to have two copies of things like this in the docs or else people could forget to update this.
|
QA tests have started for PR 2519 at commit
|
|
My recent commits address @pwendell 's comments but I'd like to include an answer to my first two bullet points from the summary before merging:
|
|
QA tests have finished for PR 2519 at commit
|
|
Test PASSed. |
|
In the absence of feedback about the above questions and in an effort to clarify this at least somewhat in the docs, I think we should merge this docs-only PR as-is for the Spark 1.2.0 release. We can always extend the docs later with clarifications if needed. @pwendell would you please merge? |
|
Hey @ash211 I'm going to pull this in, thanks for working on it. One thing I do wonder is if there are more actionable take-aways from this for users. In my experience the defaults are usually just fine, it's not super clear to me when users would need to tune this. |
cc kayousterhout I have a few outstanding questions from compiling this documentation: - What's the difference between NO_PREF and ANY? I understand the implications of the ordering but don't know what an example of each would be - Why is NO_PREF ahead of RACK_LOCAL? I would think it'd be better to schedule rack-local tasks ahead of no preference if you could only do one or the other. Is the idea to wait longer and hope for the rack-local tasks to turn into node-local or better? - Will there be a datacenter-local locality level in the future? Apache Cassandra for example has this level Author: Andrew Ash <[email protected]> Closes #2519 from ash211/SPARK-3526 and squashes the following commits: 44cff28 [Andrew Ash] Link to spark.locality parameters rather than copying the list 6d5d966 [Andrew Ash] Stay focused on Spark, no astronaut architecture mumbo-jumbo 20e0e31 [Andrew Ash] SPARK-3526 Add section about data locality to the tuning guide (cherry picked from commit 652b781) Signed-off-by: Patrick Wendell <[email protected]>
|
Agreed that there's probably not a ton that's immediately tunable. But someone looking to "make it faster" could read this section, realize that they have bad locality, and move their HDFS and Spark workers closer together as a result. I view this page as part prescriptive and part informative, and this section is definitely more on the informative side. |
cc @kayousterhout
I have a few outstanding questions from compiling this documentation: