[SPARK-6761][SQL] Approximate quantile for DataFrame by viirya · Pull Request #6042 · apache/spark

viirya · 2015-05-10T11:22:55Z

JIRA: https://issues.apache.org/jira/browse/SPARK-6761

Compute approximate quantile based on the paper Greenwald, Michael and Khanna, Sanjeev, "Space-efficient Online Computation of Quantile Summaries," SIGMOD '01.

SparkQA · 2015-05-10T13:23:52Z

Test build #32337 has finished for PR 6042 at commit 1086537.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-05-11T21:58:08Z

Thanks - I was thinking about having this as a UDAF after we have the new UDAF interface merged. Let's revisit this after 1.4 is stable enough.

SparkQA · 2015-07-19T10:56:36Z

Test build #37769 timed out for PR 6042 at commit 1086537 after a configured wait of 150m.

rxin · 2015-08-21T19:59:37Z

@viirya it would be great to have this as an aggregate function. Can you look into the feasibility of that?

viirya · 2015-08-22T01:10:04Z

@rxin ok. I will update this later.

viirya · 2015-08-26T13:37:03Z

@rxin I opened a new PR at #8459 for the aggregation function.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

viirya · 2016-02-09T09:25:40Z

ping @thunterdb

viirya · 2016-02-09T11:51:08Z

retest this please.

mengxr · 2016-02-09T23:12:55Z

test this please

thunterdb · 2016-02-11T16:59:21Z

@viirya sorry I missed your email, I will look at your PR today.

SparkQA · 2016-02-12T10:37:08Z

Test build #51182 has finished for PR 6042 at commit 437aaea.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-12T12:57:02Z

Test build #51185 has finished for PR 6042 at commit ac4bc97.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-12T14:42:05Z

Test build #51187 has finished for PR 6042 at commit d320fd2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-02-19T07:48:55Z

retest this please.

SparkQA · 2016-02-19T09:18:11Z

Test build #51538 has finished for PR 6042 at commit daaa196.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

[SPARK-6761][ML][SQL] Approximate quantiles - take 2

viirya · 2016-02-21T10:17:43Z

@thunterdb Your pull request looks good. Thanks. I've merged it. I will take another look later.

SparkQA · 2016-02-21T10:22:49Z

Test build #51624 has finished for PR 6042 at commit 9314f1e.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class QuantileSummaries(
- case class Stats(value: Double, g: Int, delta: Int)

SparkQA · 2016-02-21T13:17:08Z

Test build #51628 has finished for PR 6042 at commit 47cde05.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-21T15:14:54Z

Test build #51632 has finished for PR 6042 at commit a36891b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-02-23T07:21:16Z

@viirya I had an offline discussion with @thunterdb today. I'm going to add some comments inline and then merge this PR. @thunterdb will send another PR to address my comment because he already knew the context. After that, could you add approximate quantiles to describe()? Thanks!

mengxr · 2016-02-23T07:21:26Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

+  /**
+   * Calculate the approximate quantile of numerical column of a DataFrame.
+   * @param col the name of the column
+   * @param quantile the quantile number


epsilon is not documented. It might be better to call it relerr or relativeError because epsilon doesn't carry any information.

viirya · 2016-02-23T07:24:45Z

@mengxr ok. thank you.

mengxr · 2016-02-23T07:28:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

+      } else {
+        // We rely on the fact that they are ordered to efficiently interleave them.
+        val thisSampled = sampled.toList
+        val otherSampled = other.sampled.toList


I wonder how much speedup we can get by merging the two lists manually compared to (thisSampled ++ otherSampled).sorted. Did you run some tests?

I agree that the current implementation is too complicated, and that probably just merging/sorting the two arrays directly is more efficient for the size considered.

When running some performance testing, the cost of the algorithm was dominated by the cost of accessing the content of Rows. Only 4% of the running time was spent on insertion+merging, so this cost was negligible at this point.

I am going to do as you suggest. If it happens to be a bottleneck when we use UDAFs later, directly manipulating ArrayBuffers would be more efficient than pattern-matching on lists anyway. Rerunning the synthetic benchmark with the suggested changes did not yield runtime changes.

mengxr · 2016-02-23T07:31:33Z

Merged into master. @thunterdb Please send another PR to address my comments. Thanks!

thunterdb · 2016-02-23T16:30:14Z

@mengxr thanks for the review, will do in another PR.

Add support for calculating approximate quantile.

1086537

viirya mentioned this pull request May 11, 2015

[SPARK-5832][Mllib] Add Affinity Propagation clustering algorithm #4622

Closed

viirya mentioned this pull request Aug 26, 2015

[SPARK-6761][SQL] Aggregation function for approximate quantile #8459

Closed

viirya closed this Aug 26, 2015

viirya reopened this Feb 9, 2016

Merge remote-tracking branch 'upstream/master' into approximate_quantile

437aaea

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

thunterdb added 3 commits February 11, 2016 09:19

adding more tests

e1e6d94

adding tests

1ce0bfc

started checking merging

21e81ef

Fix scala style.

ac4bc97

Fix scala style.

d320fd2

thunterdb added 6 commits February 17, 2016 14:51

branched off to work on a simpler batch merging code

253f488

insert tests

e48badd

Merge remote-tracking branch 'upstream/master' into spark-6761b

699808a

tentative batch algorithm

2cba6c1

finally batch sampling is working

cbb1bb5

cleanups

773b20f

thunterdb mentioned this pull request Feb 18, 2016

[SPARK-6761][ML][SQL] Approximate quantiles viirya/spark-1#3

Closed

Merge remote-tracking branch 'upstream/master' into approximate_quantile

daaa196

thunterdb added 2 commits February 19, 2016 09:55

merge with branch

5176054

fix import order

d607fda

thunterdb mentioned this pull request Feb 19, 2016

[SPARK-6761][ML][SQL] Approximate quantiles - take 2 viirya/spark-1#4

Merged

Merge pull request #4 from thunterdb/spark-6761b

9314f1e

[SPARK-6761][ML][SQL] Approximate quantiles - take 2

Add Apache license headers.

47cde05

Fix scala style.

a36891b

mengxr reviewed Feb 23, 2016
View reviewed changes

asfgit closed this in 4fd1993 Feb 23, 2016

thunterdb mentioned this pull request Feb 23, 2016

[SPARK-6761][SQL][ML] Fixes to API and documentation of approximate quantiles #11325

Closed

viirya deleted the approximate_quantile branch December 27, 2023 18:33

Conversation

viirya commented May 10, 2015

Uh oh!

SparkQA commented May 10, 2015

Uh oh!

rxin commented May 11, 2015

Uh oh!

SparkQA commented Jul 19, 2015

Uh oh!

rxin commented Aug 21, 2015

Uh oh!

viirya commented Aug 22, 2015

Uh oh!

viirya commented Aug 26, 2015

Uh oh!

viirya commented Feb 9, 2016

Uh oh!

viirya commented Feb 9, 2016

Uh oh!

mengxr commented Feb 9, 2016

Uh oh!

thunterdb commented Feb 11, 2016

Uh oh!

SparkQA commented Feb 12, 2016

Uh oh!

SparkQA commented Feb 12, 2016

Uh oh!

SparkQA commented Feb 12, 2016

Uh oh!

viirya commented Feb 19, 2016

Uh oh!

SparkQA commented Feb 19, 2016

Uh oh!

viirya commented Feb 21, 2016

Uh oh!

SparkQA commented Feb 21, 2016

Uh oh!

SparkQA commented Feb 21, 2016

Uh oh!

SparkQA commented Feb 21, 2016

Uh oh!

mengxr commented Feb 23, 2016

Uh oh!

mengxr Feb 23, 2016

Choose a reason for hiding this comment

Uh oh!

thunterdb Feb 23, 2016

Choose a reason for hiding this comment

Uh oh!

viirya commented Feb 23, 2016

Uh oh!

mengxr Feb 23, 2016

Choose a reason for hiding this comment

Uh oh!

thunterdb Feb 23, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr commented Feb 23, 2016

Uh oh!

thunterdb commented Feb 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants