-
Notifications
You must be signed in to change notification settings - Fork 29k
Added validation check for parallelizing a seq #329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added validation check for parallelizing a seq #329
Conversation
|
Can one of the admins verify this patch? |
|
Thanks for reporting this! Could you add unit tests for both fixes? Also, would you mind creating a JIRA for this and explaining what the exception was in the old case? It just makes it easier to track fixes for this. |
|
Hi @bijaybisht, thanks for fixing this :) I had once noticed this issue, but at last decided not to change this. And my reasons are:
// Using coalesce to ensure we have exactly 4 partitions
val x = sc.textFile("input", 4).coalesce(4)
val y = sc.parallelize(1 to 3, 4)
val z = x.zipPartitions(y) { (i, j) =>
...
}( |
|
I agree with @liancheng that it is best not to change the existing semantics. At least I've been using parallelize to just launch a bunch of tasks without passing it a Seq that is big enough. What we should make sure is Spark doesn't crash in this case (parallelize s iterable who size < num partitions). |
|
Ah I see - so the reported "bug" here was not that there were any failures/exceptions but just that you'd have extra empty partitions? |
|
We should allow the empty partitions, this was intentional behavior. It's used quite a bit in our unit tests for example. Users who want fewer partitions can always check the size of the Seq themselves. |
|
@bijaybisht mind closing this? I think this is a check we don't want to have in the API. |
|
Sure, ill close this. I presume that the change for the NumericRange which results in a more balanced partitions (which is part of the fix) is also something that is not required. |
SPARK-1002: Remove Binaries from Spark Source This adds a few changes on top of the work by @ScrapCodes.
…leanup Improve the cleanup of docker-machine job
This fixes a bug where in a Seq can be converted into a RDD with partitions more than the number of elements it has.
Also fixes the bug in the handling of the NumericRange.