[SPARK-12935][SQL] DataFrame API for Count-Min Sketch#10911
[SPARK-12935][SQL] DataFrame API for Count-Min Sketch#10911liancheng wants to merge 8 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Weird, I didn't make these empty comment line changes. Reverting them.
|
Test build #50055 has finished for PR 10911 at commit
|
There was a problem hiding this comment.
it'd be good to refactor this so we don't need to assign the variables. one way is to take the serialization/deserialization code out of readFrom into a function.
|
Test build #50061 has finished for PR 10911 at commit
|
|
cc @JoshRosen is the python tests broken? |
There was a problem hiding this comment.
how about colType == StringType || colType.isInstanceOf[IntegralType]?
There was a problem hiding this comment.
Actually after thinking about it - let's avoid doing that and list the explicit types. It is plausible in the future we introduce an int96 or int128 data type, and I bet we won't remember this is one place we need to update it.
There was a problem hiding this comment.
This comment has been moved to CountMinSketch.Version as @rxin suggested in #10920 (comment)
|
Test build #50117 has finished for PR 10911 at commit
|
|
Josh is looking into the PySpark test failure. |
| <version>1.5.6</version> | ||
| <type>jar</type> | ||
| </dependency> | ||
| <dependency> |
There was a problem hiding this comment.
use scala.binary.version?
There was a problem hiding this comment.
Actually this is always hard coded as _2.10 to make publishing easier.
There was a problem hiding this comment.
@rxin told me this. I'm not quite sure about the details though :)
| return sketch; | ||
| } | ||
|
|
||
| private void readFrom0(InputStream in) throws IOException { |
There was a problem hiding this comment.
this name is quite weird...
There was a problem hiding this comment.
this is actually a common naming style in java - to have the private version named xxx0
There was a problem hiding this comment.
I just realized that this is now in a Javadoc block. Should reformat this using HTML tags. Same thing applies to the bloom filter format description.
|
Test build #50126 has finished for PR 10911 at commit
|
|
Test build #50146 has finished for PR 10911 at commit
|
|
I'm going to merge this. Thanks. |
This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to
RDD.aggregatefor building the sketch. A more performant UDAF version can be built in future follow-up PRs.