Skip to content

[SPARK-6761][SQL] Approximate quantile for DataFrame#6042

Closed
viirya wants to merge 20 commits intoapache:masterfrom
viirya:approximate_quantile
Closed

[SPARK-6761][SQL] Approximate quantile for DataFrame#6042
viirya wants to merge 20 commits intoapache:masterfrom
viirya:approximate_quantile

Conversation

@viirya
Copy link
Member

@viirya viirya commented May 10, 2015

JIRA: https://issues.apache.org/jira/browse/SPARK-6761

Compute approximate quantile based on the paper Greenwald, Michael and Khanna, Sanjeev, "Space-efficient Online Computation of Quantile Summaries," SIGMOD '01.

@SparkQA
Copy link

SparkQA commented May 10, 2015

Test build #32337 has finished for PR 6042 at commit 1086537.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented May 11, 2015

Thanks - I was thinking about having this as a UDAF after we have the new UDAF interface merged. Let's revisit this after 1.4 is stable enough.

@SparkQA
Copy link

SparkQA commented Jul 19, 2015

Test build #37769 timed out for PR 6042 at commit 1086537 after a configured wait of 150m.

@rxin
Copy link
Contributor

rxin commented Aug 21, 2015

@viirya it would be great to have this as an aggregate function. Can you look into the feasibility of that?

@viirya
Copy link
Member Author

viirya commented Aug 22, 2015

@rxin ok. I will update this later.

@viirya
Copy link
Member Author

viirya commented Aug 26, 2015

@rxin I opened a new PR at #8459 for the aggregation function.

@viirya viirya closed this Aug 26, 2015
@viirya viirya reopened this Feb 9, 2016
Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
@viirya
Copy link
Member Author

viirya commented Feb 9, 2016

ping @thunterdb

@viirya
Copy link
Member Author

viirya commented Feb 9, 2016

retest this please.

@mengxr
Copy link
Contributor

mengxr commented Feb 9, 2016

test this please

@thunterdb
Copy link
Contributor

@viirya sorry I missed your email, I will look at your PR today.

@SparkQA
Copy link

SparkQA commented Feb 12, 2016

Test build #51182 has finished for PR 6042 at commit 437aaea.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2016

Test build #51185 has finished for PR 6042 at commit ac4bc97.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2016

Test build #51187 has finished for PR 6042 at commit d320fd2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Feb 19, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Feb 19, 2016

Test build #51538 has finished for PR 6042 at commit daaa196.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

[SPARK-6761][ML][SQL] Approximate quantiles - take 2
@viirya
Copy link
Member Author

viirya commented Feb 21, 2016

@thunterdb Your pull request looks good. Thanks. I've merged it. I will take another look later.

@SparkQA
Copy link

SparkQA commented Feb 21, 2016

Test build #51624 has finished for PR 6042 at commit 9314f1e.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class QuantileSummaries(
    • case class Stats(value: Double, g: Int, delta: Int)

@SparkQA
Copy link

SparkQA commented Feb 21, 2016

Test build #51628 has finished for PR 6042 at commit 47cde05.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2016

Test build #51632 has finished for PR 6042 at commit a36891b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Feb 23, 2016

@viirya I had an offline discussion with @thunterdb today. I'm going to add some comments inline and then merge this PR. @thunterdb will send another PR to address my comment because he already knew the context. After that, could you add approximate quantiles to describe()? Thanks!

/**
* Calculate the approximate quantile of numerical column of a DataFrame.
* @param col the name of the column
* @param quantile the quantile number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

epsilon is not documented. It might be better to call it relerr or relativeError because epsilon doesn't carry any information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@viirya
Copy link
Member Author

viirya commented Feb 23, 2016

@mengxr ok. thank you.

} else {
// We rely on the fact that they are ordered to efficiently interleave them.
val thisSampled = sampled.toList
val otherSampled = other.sampled.toList
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much speedup we can get by merging the two lists manually compared to (thisSampled ++ otherSampled).sorted. Did you run some tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the current implementation is too complicated, and that probably just merging/sorting the two arrays directly is more efficient for the size considered.

When running some performance testing, the cost of the algorithm was dominated by the cost of accessing the content of Rows. Only 4% of the running time was spent on insertion+merging, so this cost was negligible at this point.

I am going to do as you suggest. If it happens to be a bottleneck when we use UDAFs later, directly manipulating ArrayBuffers would be more efficient than pattern-matching on lists anyway. Rerunning the synthetic benchmark with the suggested changes did not yield runtime changes.

@mengxr
Copy link
Contributor

mengxr commented Feb 23, 2016

Merged into master. @thunterdb Please send another PR to address my comments. Thanks!

@asfgit asfgit closed this in 4fd1993 Feb 23, 2016
@thunterdb
Copy link
Contributor

@mengxr thanks for the review, will do in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants