-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-2978. Transformation with MR shuffle semantics #2274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2274 at commit
|
|
QA tests have finished for PR 2274 at commit
|
|
QA tests have started for PR 2274 at commit
|
|
QA tests have finished for PR 2274 at commit
|
python/pyspark/rdd.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about re-arrange the parameters to follow the function name? such as:
repartitionAndSortWithinPartition(self, numPartitions=None, partitionFunc=portable_hash,
ascending=True, keyfunc=lambda x: x)
a1ef807 to
423650a
Compare
|
Updated patch removes Python version, adds Java version, and adds some additional doc. |
|
Just a nit, it should probably be called repartitionAndSortWithinPartition_s_. Also, this name is pretty long. Another one I'd reconsider is Finally I think it should be a policy to add all these APIs to Python, and implement them there too. Basically there are two options -- if you're doing this to support a slightly easier transition from MR jobs, but you don't want to do it in Python, you could just have it as a document, or an example, or maybe even a third-party package that takes a Hadoop JobConf and runs it on Spark. But if you want it in Spark, we need to put it in each language. The reason is to allow people to easily read code in one supported language and run it in others -- it's always disappointing when some operators turn out to be missing in yours. |
|
The reason to add this is because this is a smaller API that we can support (both source and binary compatibility) in the long run before finalizing ShuffledRDD (since that one has been in flux and changing in multiple past releases). Perhaps we can mark this new API as DeveloperApi but commit to maintaining it. What do you think? The naming is long, but I'm worried repartitionWithSort in a way implies the data are sorted globally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u can put this on the previous line ...
|
Ah, I see. Then we can add it, but in that case I'd also add it in Python. |
15b2f90 to
1340d75
Compare
|
Updated patch adds Python back in and adds the 's' at the end. |
|
Thanks, Sandy. Can you add a unit test in Java to make sure the thing is callable from Java? |
python/pyspark/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are removed by accident during merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, my bad
f249f74 to
c04b447
Compare
|
QA tests have started for PR 2274 at commit
|
|
QA tests have finished for PR 2274 at commit
|
|
QA tests have started for PR 2274 at commit
|
|
QA tests have finished for PR 2274 at commit
|
|
LGTM, thanks. |
|
Thanks Sandy! I've merged this. |
I didn't add this to the transformations list in the docs because it's kind of obscure, but would be happy to do so if others think it would be helpful.