-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-1597: Add a version of reduceByKey that takes the Partitioner as a... #550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…s a second argument Most of our shuffle methods can take a Partitioner or a number of partitions as a second argument, but for some reason reduceByKey takes the Partitioner as a first argument: http://spark.apache.org/docs/0.9.1/api/core/#org.apache.spark.rdd.PairRDDFunctions. Deprecated that version and added one where the Partitioner is the second argument.
|
We'll need to specify the parameter types for function passed to reduceByKey @mateiz IMHO we should leave the method as it is, as this will make the code ugly. |
|
Can one of the admins verify this patch? |
|
Ah, wow, I never knew that. So if one takes a Partitioner first and one takes a function, the types are inferred, but if both take a function first, they're not? In that case we might want to change our other methods too, like cogroup and groupByKey, to take a Partitioner first. Wouldn't this problem also affect them? |
|
@mateiz I think this only applies with anon function's, thus isn't affecting either cogroup or groupByKey. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is over 100 chars wide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin will fix this as soon as, a decision is made over whether we want to do this or not.
|
I never even realized we had a version of reduceByKey where the first argument is not the closure ... |
|
I have one solution to this, although it is technically an API change, so just throwing it out there for discussion. We can remove all the numPartitions: Int arguments, and add an implicit conversion from int to HashPartitioner. |
|
@rxin +1 |
|
I'd rather not add the implicit conversion from int to partitioner, it will be very hard to discover on its own. Instead maybe we can just leave this API as is. It's strange but there's a good reason for it. |
|
QA tests have started for PR 550. This patch merges cleanly. |
|
QA results for PR 550: |
|
It sounds like the conclusion here is to close this issue then. |
This commit exists to close the following pull requests on Github: Closes apache#1328 (close requested by 'pwendell') Closes apache#2314 (close requested by 'pwendell') Closes apache#997 (close requested by 'pwendell') Closes apache#550 (close requested by 'pwendell') Closes apache#1506 (close requested by 'pwendell') Closes apache#2423 (close requested by 'mengxr') Closes apache#554 (close requested by 'joshrosen')
### What changes were proposed in this pull request? Due to a quirk in the parser, in some cases, IDENTIFIER(<funcStr>)(<arg>) is not properly recognized as a function invocation. The change is to remove the explicit IDENTIFIER-clause rule in the function invocation grammar and instead recognize IDENTIFIER(<arg>) within visitFunctionCall. ### Why are the changes needed? Function invocation support for IDENTIFIER is incomplete otherwise ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new testcases to identifier-clause.sql ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#42888 from srielau/SPARK-45132. Lead-authored-by: srielau <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit f0b2e6d) Signed-off-by: Wenchen Fan <[email protected]> * fix --------- Signed-off-by: Wenchen Fan <[email protected]> Co-authored-by: srielau <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]>
... second argument
Most of our shuffle methods can take a Partitioner or a number of partitions as a second argument, but for some reason reduceByKey takes the Partitioner as a first argument: http://spark.apache.org/docs/0.9.1/api/core/#org.apache.spark.rdd.PairRDDFunctions.
Deprecated that version and added one where the Partitioner is the second argument.