Skip to content

Commit 9d2565f

Browse files
committed
add commments and link for hash collision correction
1 parent d306492 commit 9d2565f

File tree

1 file changed

+4
-2
lines changed

1 file changed

+4
-2
lines changed

python/pyspark/rdd.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2027,8 +2027,10 @@ def countApproxDistinct(self, relativeSD=0.05):
20272027
c = hashRDD._to_java_object_rdd().countApproxDistinct(relativeSD)
20282028
# range of hash is [0, sys.maxint]
20292029
if c > sys.maxint / 30:
2030-
# correction for hash collision in Python
2031-
c = -sys.maxint * log(1 - float(c) / sys.maxint)
2030+
# correction for hash collision in Python,
2031+
# hash collision probability is 1 - exp(-X), so X = - log(1 - p)
2032+
# see http://preshing.com/20110504/hash-collision-probabilities/
2033+
c = - sys.maxint * log(1 - float(c) / sys.maxint)
20322034
return int(c)
20332035

20342036

0 commit comments

Comments
 (0)