Skip to content

Conversation

@bijaybisht
Copy link
Contributor

This fixes a bug where in a Seq can be converted into a RDD with partitions more than the number of elements it has.

Also fixes the bug in the handling of the NumericRange.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@pwendell
Copy link
Contributor

pwendell commented Apr 5, 2014

Thanks for reporting this! Could you add unit tests for both fixes? Also, would you mind creating a JIRA for this and explaining what the exception was in the old case? It just makes it easier to track fixes for this.

@liancheng
Copy link
Contributor

Hi @bijaybisht, thanks for fixing this :) I had once noticed this issue, but at last decided not to change this. And my reasons are:

  1. Although not specified in the ScalaDoc, the numSlices parameter of SparkContext.parallelize specifies the exact partition number of the result RDD. This PR actually changes semantics of the numSlices parameter.
  2. An RDD can have more partitions than its elements. For example, RDD.filter may result empty partitions.
  3. For APIs like RDD.zipPartitions, partition number is significant, and this change may break some existing code. For example:
// Using coalesce to ensure we have exactly 4 partitions
val x = sc.textFile("input", 4).coalesce(4) 
val y = sc.parallelize(1 to 3, 4)
val z = x.zipPartitions(y) { (i, j) =>
  ...
}

(x.zipPartitions(y) requires x & y have exactly the same partition number):

@rxin
Copy link
Contributor

rxin commented Apr 5, 2014

I agree with @liancheng that it is best not to change the existing semantics. At least I've been using parallelize to just launch a bunch of tasks without passing it a Seq that is big enough. What we should make sure is Spark doesn't crash in this case (parallelize s iterable who size < num partitions).

@pwendell
Copy link
Contributor

pwendell commented Apr 5, 2014

Ah I see - so the reported "bug" here was not that there were any failures/exceptions but just that you'd have extra empty partitions?

@mateiz
Copy link
Contributor

mateiz commented Apr 6, 2014

We should allow the empty partitions, this was intentional behavior. It's used quite a bit in our unit tests for example. Users who want fewer partitions can always check the size of the Seq themselves.

@pwendell
Copy link
Contributor

pwendell commented Apr 7, 2014

@bijaybisht mind closing this? I think this is a check we don't want to have in the API.

@bijaybisht
Copy link
Contributor Author

Sure, ill close this. I presume that the change for the NumericRange which results in a more balanced partitions (which is part of the fix) is also something that is not required.

@bijaybisht bijaybisht closed this Apr 7, 2014
andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Apr 7, 2014
SPARK-1002: Remove Binaries from Spark Source

This adds a few changes on top of the work by @ScrapCodes.
@nitindexter nitindexter deleted the hotfix-spark-0.9.1/parallelize_validation branch January 16, 2015 09:10
lins05 pushed a commit to lins05/spark that referenced this pull request Jun 8, 2017
erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
…leanup

Improve the cleanup of docker-machine job
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants