Skip to content

Commit 201c301

Browse files
author
Erik Selin
committed
Make partitionBy use a tweaked version of hash as its default partition function
since the python hash function does not consistently assign the same value to None across python processes.
1 parent d666053 commit 201c301

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

python/pyspark/rdd.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -913,7 +913,7 @@ def rightOuterJoin(self, other, numPartitions=None):
913913
return python_right_outer_join(self, other, numPartitions)
914914

915915
# TODO: add option to control map-side combining
916-
def partitionBy(self, numPartitions, partitionFunc=hash):
916+
def partitionBy(self, numPartitions, partitionFunc=None):
917917
"""
918918
Return a copy of the RDD partitioned using the specified partitioner.
919919
@@ -924,6 +924,9 @@ def partitionBy(self, numPartitions, partitionFunc=hash):
924924
"""
925925
if numPartitions is None:
926926
numPartitions = self.ctx.defaultParallelism
927+
928+
if partitionFunc is None:
929+
partitionFunc = lambda x: 0 if x is None else hash(x)
927930
# Transferring O(n) objects to Java is too expensive. Instead, we'll
928931
# form the hash buckets in Python, transferring O(numPartitions) objects
929932
# to Java. Each object is a (splitNumber, [objects]) pair.

0 commit comments

Comments
 (0)