Added validation check for parallelizing a seq #329

bijaybisht · 2014-04-05T01:45:56Z

This fixes a bug where in a Seq can be converted into a RDD with partitions more than the number of elements it has.

Also fixes the bug in the handling of the NumericRange.

AmplabJenkins · 2014-04-05T01:47:23Z

Can one of the admins verify this patch?

pwendell · 2014-04-05T02:16:10Z

Thanks for reporting this! Could you add unit tests for both fixes? Also, would you mind creating a JIRA for this and explaining what the exception was in the old case? It just makes it easier to track fixes for this.

liancheng · 2014-04-05T13:29:55Z

Hi @bijaybisht, thanks for fixing this :) I had once noticed this issue, but at last decided not to change this. And my reasons are:

Although not specified in the ScalaDoc, the numSlices parameter of SparkContext.parallelize specifies the exact partition number of the result RDD. This PR actually changes semantics of the numSlices parameter.
An RDD can have more partitions than its elements. For example, RDD.filter may result empty partitions.
For APIs like RDD.zipPartitions, partition number is significant, and this change may break some existing code. For example:

// Using coalesce to ensure we have exactly 4 partitions
val x = sc.textFile("input", 4).coalesce(4) 
val y = sc.parallelize(1 to 3, 4)
val z = x.zipPartitions(y) { (i, j) =>
  ...
}

(x.zipPartitions(y) requires x & y have exactly the same partition number):

rxin · 2014-04-05T18:25:31Z

I agree with @liancheng that it is best not to change the existing semantics. At least I've been using parallelize to just launch a bunch of tasks without passing it a Seq that is big enough. What we should make sure is Spark doesn't crash in this case (parallelize s iterable who size < num partitions).

pwendell · 2014-04-05T19:17:23Z

Ah I see - so the reported "bug" here was not that there were any failures/exceptions but just that you'd have extra empty partitions?

mateiz · 2014-04-06T01:26:55Z

We should allow the empty partitions, this was intentional behavior. It's used quite a bit in our unit tests for example. Users who want fewer partitions can always check the size of the Seq themselves.

pwendell · 2014-04-07T05:17:02Z

@bijaybisht mind closing this? I think this is a check we don't want to have in the API.

bijaybisht · 2014-04-07T17:27:37Z

Sure, ill close this. I presume that the change for the NumericRange which results in a more balanced partitions (which is part of the fix) is also something that is not required.

@ScrapCodes

SPARK-1002: Remove Binaries from Spark Source This adds a few changes on top of the work by @ScrapCodes.

…leanup Improve the cleanup of docker-machine job

Added validation check for parallelizing a seq

b54236a

bijaybisht closed this Apr 7, 2014

andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Apr 7, 2014

Merge pull request apache#329 from pwendell/remove-binaries

10fe23b

SPARK-1002: Remove Binaries from Spark Source This adds a few changes on top of the work by @ScrapCodes.

nitindexter deleted the hotfix-spark-0.9.1/parallelize_validation branch January 16, 2015 09:10

lins05 pushed a commit to lins05/spark that referenced this pull request Jun 8, 2017

POM update 0.2.0 (apache#329)

4751371

erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017

POM update 0.2.0 (apache#329)

f208d68

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#329 from liu-sheng/improve-docker-machine-c…

f8f97fe

…leanup Improve the cleanup of docker-machine job

turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025

[CARMEL-7494] Enhance disable unnecessary bucketed scan (apache#329)

aafb66f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added validation check for parallelizing a seq #329

Added validation check for parallelizing a seq #329

Uh oh!

bijaybisht commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

pwendell commented Apr 5, 2014

Uh oh!

liancheng commented Apr 5, 2014

Uh oh!

rxin commented Apr 5, 2014

Uh oh!

pwendell commented Apr 5, 2014

Uh oh!

mateiz commented Apr 6, 2014

Uh oh!

pwendell commented Apr 7, 2014

Uh oh!

bijaybisht commented Apr 7, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Added validation check for parallelizing a seq #329

Added validation check for parallelizing a seq #329

Uh oh!

Conversation

bijaybisht commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

pwendell commented Apr 5, 2014

Uh oh!

liancheng commented Apr 5, 2014

Uh oh!

rxin commented Apr 5, 2014

Uh oh!

pwendell commented Apr 5, 2014

Uh oh!

mateiz commented Apr 6, 2014

Uh oh!

pwendell commented Apr 7, 2014

Uh oh!

bijaybisht commented Apr 7, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants