Commit e2c901b
[SPARK-2871] [PySpark] add countApproxDistinct() API
RDD.countApproxDistinct(relativeSD=0.05):
:: Experimental ::
Return approximate number of distinct elements in the RDD.
The algorithm used is based on streamlib's implementation of
"HyperLogLog in Practice: Algorithmic Engineering of a State
of The Art Cardinality Estimation Algorithm", available
<a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.
This support all the types of objects, which is supported by
Pyrolite, nearly all builtin types.
param relativeSD Relative accuracy. Smaller values create
counters that require more space.
It must be greater than 0.000017.
>>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
>>> 950 < n < 1050
True
>>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
>>> 18 < n < 22
True
Author: Davies Liu <[email protected]>
Closes apache#2142 from davies/countApproxDistinct and squashes the following commits:
e20da47 [Davies Liu] remove the correction in Python
c38c4e4 [Davies Liu] fix doc tests
2ab157c [Davies Liu] fix doc tests
9d2565f [Davies Liu] add commments and link for hash collision correction
d306492 [Davies Liu] change range of hash of tuple to [0, maxint]
ded624f [Davies Liu] calculate hash in Python
4cba98f [Davies Liu] add more tests
a85a8c6 [Davies Liu] Merge branch 'master' into countApproxDistinct
e97e342 [Davies Liu] add countApproxDistinct()1 parent 81b9d5b commit e2c901b
3 files changed
+51
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
993 | 993 | | |
994 | 994 | | |
995 | 995 | | |
996 | | - | |
| 996 | + | |
997 | 997 | | |
998 | 998 | | |
999 | 999 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
65 | | - | |
| 65 | + | |
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
| |||
72 | 72 | | |
73 | 73 | | |
74 | 74 | | |
75 | | - | |
| 75 | + | |
76 | 76 | | |
77 | 77 | | |
78 | 78 | | |
| |||
1942 | 1942 | | |
1943 | 1943 | | |
1944 | 1944 | | |
1945 | | - | |
| 1945 | + | |
1946 | 1946 | | |
1947 | 1947 | | |
1948 | 1948 | | |
| |||
1977 | 1977 | | |
1978 | 1978 | | |
1979 | 1979 | | |
1980 | | - | |
| 1980 | + | |
1981 | 1981 | | |
1982 | 1982 | | |
1983 | 1983 | | |
| |||
1993 | 1993 | | |
1994 | 1994 | | |
1995 | 1995 | | |
1996 | | - | |
| 1996 | + | |
1997 | 1997 | | |
1998 | 1998 | | |
1999 | 1999 | | |
2000 | 2000 | | |
| 2001 | + | |
| 2002 | + | |
| 2003 | + | |
| 2004 | + | |
| 2005 | + | |
| 2006 | + | |
| 2007 | + | |
| 2008 | + | |
| 2009 | + | |
| 2010 | + | |
| 2011 | + | |
| 2012 | + | |
| 2013 | + | |
| 2014 | + | |
| 2015 | + | |
| 2016 | + | |
| 2017 | + | |
| 2018 | + | |
| 2019 | + | |
| 2020 | + | |
| 2021 | + | |
| 2022 | + | |
| 2023 | + | |
| 2024 | + | |
| 2025 | + | |
| 2026 | + | |
| 2027 | + | |
| 2028 | + | |
| 2029 | + | |
2001 | 2030 | | |
2002 | 2031 | | |
2003 | 2032 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
404 | 404 | | |
405 | 405 | | |
406 | 406 | | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
407 | 423 | | |
408 | 424 | | |
409 | 425 | | |
| |||
0 commit comments