[SPARK-35558] Optimizes for multi-quantile retrieval #32700

alkispoly-db · 2021-05-29T00:37:09Z

What changes were proposed in this pull request?

Optimizes the retrieval of approximate quantiles for an array of percentiles.

Adds an overload for QuantileSummaries.query that accepts an array of percentiles and optimizes the computation to do a single pass over the sketch and avoid redundant computation.
Modifies the ApproximatePercentiles operator to call into the new method.

All formatting changes are the result of running ./dev/scalafmt

Why are the changes needed?

The existing implementation does repeated calls per input percentile resulting in redundant computation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests for the new method.

srowen · 2021-05-29T14:09:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala

+          result(pos) = sampled.last.value
+        } else {
+          val (newIndex, newMinRank, approxQuantile) =
+            findApproxQuantile(index, minRank, targetError, percentile)


I don't need a benchmark or anything, but is this much faster if it calls this method repeatedly? I think it saves some common computation, from what I can see

If by this method you mean QuantileSummaries.query then there is evidence from profiles that this method becomes a bottleneck as the percentile list grows, and in particular the redundant computation seems to be the root cause.

srowen · 2021-05-29T14:09:59Z

Jenkins test this please

SparkQA · 2021-05-29T15:25:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43602/

SparkQA · 2021-05-29T15:57:58Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43602/

SparkQA · 2021-05-29T23:05:08Z

Test build #139081 has finished for PR 32700 at commit 8d33ba9.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-05-31T14:00:27Z

Jenkins retest this please

SparkQA · 2021-05-31T14:48:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43638/

SparkQA · 2021-05-31T15:22:06Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43638/

SparkQA · 2021-05-31T22:23:53Z

Test build #139118 has finished for PR 32700 at commit 8d33ba9.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-06-01T01:01:56Z

Not sure if it's definitely related, but it looks like this results in tests that hang forever:
[info] *** Test still running after 16 minutes, 2 seconds: suite name: AdaptiveQueryExecSuite, test name: SPARK-33933: Materialize BroadcastQueryStage first in AQE.

Not 100% sure how it's connected, but, doesn't seem to be happening on other PRs?

srowen · 2021-06-01T13:27:05Z

Could be related to #32725

alkispoly-db · 2021-06-01T23:30:22Z

Could be related to #32725

I can wait until that PR is closed and retest.

srowen · 2021-06-03T13:06:30Z

Jenkins retest this please

SparkQA · 2021-06-03T14:00:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43817/

SparkQA · 2021-06-03T14:34:08Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43817/

SparkQA · 2021-06-03T17:09:17Z

Test build #139293 has finished for PR 32700 at commit cb7df06.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-06-05T19:25:47Z

Merged to master.

alkispoly-db added 2 commits May 28, 2021 16:37

Optimizes for multi-quantile retrieval

385d442

Fixes compilation errors

93403ea

github-actions bot added the SQL label May 29, 2021

alkispoly-db added 2 commits May 28, 2021 18:01

Undid somme formatting changes

96ca632

Fixed tests for Null values

8d33ba9

srowen reviewed May 29, 2021

View reviewed changes

Use toSeq to convert from ArrayBuffer to Seq

cb7df06

alkispoly-db requested a review from srowen June 2, 2021 21:17

srowen closed this in 6f8c620 Jun 5, 2021

[SPARK-35558] Optimizes for multi-quantile retrieval #32700

[SPARK-35558] Optimizes for multi-quantile retrieval #32700

Uh oh!

Conversation

alkispoly-db commented May 29, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

srowen May 29, 2021

Choose a reason for hiding this comment

Uh oh!

alkispoly-db Jun 1, 2021

Choose a reason for hiding this comment

Uh oh!

srowen commented May 29, 2021

Uh oh!

SparkQA commented May 29, 2021

Uh oh!

SparkQA commented May 29, 2021

Uh oh!

SparkQA commented May 29, 2021

Uh oh!

srowen commented May 31, 2021

Uh oh!

SparkQA commented May 31, 2021

Uh oh!

SparkQA commented May 31, 2021

Uh oh!

SparkQA commented May 31, 2021

Uh oh!

srowen commented Jun 1, 2021

Uh oh!

srowen commented Jun 1, 2021

Uh oh!

alkispoly-db commented Jun 1, 2021

Uh oh!

srowen commented Jun 3, 2021

Uh oh!

SparkQA commented Jun 3, 2021

Uh oh!

SparkQA commented Jun 3, 2021

Uh oh!

SparkQA commented Jun 3, 2021

Uh oh!

srowen commented Jun 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants