Skip to content

Commit d4aed26

Browse files
Davies LiuJoshRosen
authored andcommitted
[SPARK-4304] [PySpark] Fix sort on empty RDD (1.0 branch)
This PR fix sortBy()/sortByKey() on empty RDD. This should be back ported into 1.0 Author: Davies Liu <[email protected]> Closes #3163 from davies/fix_sort_1.0 and squashes the following commits: 9be984f [Davies Liu] fix sort on empty RDD
1 parent 18c8c38 commit d4aed26

File tree

2 files changed

+5
-0
lines changed

2 files changed

+5
-0
lines changed

python/pyspark/rdd.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -496,6 +496,8 @@ def sortByKey(self, ascending=True, numPartitions=None, keyfunc = lambda x: x):
496496
# number of (key, value) pairs falling into them
497497
if numPartitions > 1:
498498
rddSize = self.count()
499+
if not rddSize:
500+
return self
499501
maxSampleSize = numPartitions * 20.0 # constant from Spark's RangePartitioner
500502
fraction = min(maxSampleSize / max(rddSize, 1), 1.0)
501503

python/pyspark/tests.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,9 @@ def test_deleting_input_files(self):
198198
os.unlink(tempFile.name)
199199
self.assertRaises(Exception, lambda: filtered_data.count())
200200

201+
def test_sort_on_empty_rdd(self):
202+
self.assertEqual([], self.sc.parallelize(zip([], [])).sortByKey().collect())
203+
201204
def test_itemgetter(self):
202205
rdd = self.sc.parallelize([range(10)])
203206
from operator import itemgetter

0 commit comments

Comments
 (0)