[SPARK-6761][SQL] Approximate quantile for DataFrame#6042
[SPARK-6761][SQL] Approximate quantile for DataFrame#6042viirya wants to merge 20 commits intoapache:masterfrom
Conversation
|
Test build #32337 has finished for PR 6042 at commit
|
|
Thanks - I was thinking about having this as a UDAF after we have the new UDAF interface merged. Let's revisit this after 1.4 is stable enough. |
|
Test build #37769 timed out for PR 6042 at commit |
|
@viirya it would be great to have this as an aggregate function. Can you look into the feasibility of that? |
|
@rxin ok. I will update this later. |
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
|
ping @thunterdb |
|
retest this please. |
|
test this please |
|
@viirya sorry I missed your email, I will look at your PR today. |
|
Test build #51182 has finished for PR 6042 at commit
|
|
Test build #51185 has finished for PR 6042 at commit
|
|
Test build #51187 has finished for PR 6042 at commit
|
|
retest this please. |
|
Test build #51538 has finished for PR 6042 at commit
|
[SPARK-6761][ML][SQL] Approximate quantiles - take 2
|
@thunterdb Your pull request looks good. Thanks. I've merged it. I will take another look later. |
|
Test build #51624 has finished for PR 6042 at commit
|
|
Test build #51628 has finished for PR 6042 at commit
|
|
Test build #51632 has finished for PR 6042 at commit
|
|
@viirya I had an offline discussion with @thunterdb today. I'm going to add some comments inline and then merge this PR. @thunterdb will send another PR to address my comment because he already knew the context. After that, could you add approximate quantiles to |
| /** | ||
| * Calculate the approximate quantile of numerical column of a DataFrame. | ||
| * @param col the name of the column | ||
| * @param quantile the quantile number |
There was a problem hiding this comment.
epsilon is not documented. It might be better to call it relerr or relativeError because epsilon doesn't carry any information.
|
@mengxr ok. thank you. |
| } else { | ||
| // We rely on the fact that they are ordered to efficiently interleave them. | ||
| val thisSampled = sampled.toList | ||
| val otherSampled = other.sampled.toList |
There was a problem hiding this comment.
I wonder how much speedup we can get by merging the two lists manually compared to (thisSampled ++ otherSampled).sorted. Did you run some tests?
There was a problem hiding this comment.
I agree that the current implementation is too complicated, and that probably just merging/sorting the two arrays directly is more efficient for the size considered.
When running some performance testing, the cost of the algorithm was dominated by the cost of accessing the content of Rows. Only 4% of the running time was spent on insertion+merging, so this cost was negligible at this point.
I am going to do as you suggest. If it happens to be a bottleneck when we use UDAFs later, directly manipulating ArrayBuffers would be more efficient than pattern-matching on lists anyway. Rerunning the synthetic benchmark with the suggested changes did not yield runtime changes.
|
Merged into master. @thunterdb Please send another PR to address my comments. Thanks! |
|
@mengxr thanks for the review, will do in another PR. |
JIRA: https://issues.apache.org/jira/browse/SPARK-6761
Compute approximate quantile based on the paper Greenwald, Michael and Khanna, Sanjeev, "Space-efficient Online Computation of Quantile Summaries," SIGMOD '01.