Skip to content

Conversation

@alkispoly-db
Copy link
Contributor

What changes were proposed in this pull request?

Optimizes the retrieval of approximate quantiles for an array of percentiles.

  • Adds an overload for QuantileSummaries.query that accepts an array of percentiles and optimizes the computation to do a single pass over the sketch and avoid redundant computation.
  • Modifies the ApproximatePercentiles operator to call into the new method.

All formatting changes are the result of running ./dev/scalafmt

Why are the changes needed?

The existing implementation does repeated calls per input percentile resulting in redundant computation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests for the new method.

@github-actions github-actions bot added the SQL label May 29, 2021
result(pos) = sampled.last.value
} else {
val (newIndex, newMinRank, approxQuantile) =
findApproxQuantile(index, minRank, targetError, percentile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't need a benchmark or anything, but is this much faster if it calls this method repeatedly? I think it saves some common computation, from what I can see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If by this method you mean QuantileSummaries.query then there is evidence from profiles that this method becomes a bottleneck as the percentile list grows, and in particular the redundant computation seems to be the root cause.

@srowen
Copy link
Member

srowen commented May 29, 2021

Jenkins test this please

@SparkQA
Copy link

SparkQA commented May 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43602/

@SparkQA
Copy link

SparkQA commented May 29, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43602/

@SparkQA
Copy link

SparkQA commented May 29, 2021

Test build #139081 has finished for PR 32700 at commit 8d33ba9.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented May 31, 2021

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented May 31, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43638/

@SparkQA
Copy link

SparkQA commented May 31, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43638/

@SparkQA
Copy link

SparkQA commented May 31, 2021

Test build #139118 has finished for PR 32700 at commit 8d33ba9.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 1, 2021

Not sure if it's definitely related, but it looks like this results in tests that hang forever:
[info] *** Test still running after 16 minutes, 2 seconds: suite name: AdaptiveQueryExecSuite, test name: SPARK-33933: Materialize BroadcastQueryStage first in AQE.

Not 100% sure how it's connected, but, doesn't seem to be happening on other PRs?

@srowen
Copy link
Member

srowen commented Jun 1, 2021

Could be related to #32725

@alkispoly-db
Copy link
Contributor Author

Could be related to #32725

I can wait until that PR is closed and retest.

@alkispoly-db alkispoly-db requested a review from srowen June 2, 2021 21:17
@srowen
Copy link
Member

srowen commented Jun 3, 2021

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jun 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43817/

@SparkQA
Copy link

SparkQA commented Jun 3, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43817/

@SparkQA
Copy link

SparkQA commented Jun 3, 2021

Test build #139293 has finished for PR 32700 at commit cb7df06.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen closed this in 6f8c620 Jun 5, 2021
@srowen
Copy link
Member

srowen commented Jun 5, 2021

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants