[SPARK-37165][SQL] Add REPEATABLE in TABLESAMPLE to specify seed #34442

huaxingao · 2021-10-29T16:03:45Z

What changes were proposed in this pull request?

Add REPEATABLE in SQL syntax TABLESAMPLE so user can specify seed.

Why are the changes needed?

Current syntax for TABLESAMPLE:

TABLESAMPLE(x PERCENT)
TABLESAMPLE(BUCKET x OUT OF y)

Dataset.sample has a param to specify seed, so we should allow SQL has a way to specify seed too.

  def sample(fraction: Double, seed: Long): Dataset[T] = {
    sample(withReplacement = false, fraction = fraction, seed = seed)
  }

Most of the DBMS uses REPEATABLE to let user specify seed, e.g. DB2, we will follow the same way.

Does this PR introduce any user-facing change?

Yes
new SQL syntax

TABLESAMPLE(x PERCENT) [REPEATABLE (seed)]
TABLESAMPLE(BUCKET x OUT OF y) [REPEATABLE (seed)]

How was this patch tested?

new UT

SparkQA · 2021-10-29T17:04:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49230/

SparkQA · 2021-10-29T17:48:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49230/

huaxingao · 2021-10-29T20:40:30Z

cc @viirya

SparkQA · 2021-10-29T21:10:49Z

Test build #144761 has finished for PR 34442 at commit 5709a89.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2021-10-29T21:15:17Z

retest this please

SparkQA · 2021-10-29T22:00:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49238/

SparkQA · 2021-10-29T22:59:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49238/

viirya · 2021-10-30T00:29:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

   * are defined as a number between 0 and 100.
   * - TABLESAMPLE(BUCKET x OUT OF y): Sample the table down to a 'x' divided by 'y' fraction.
   */
  private def withSample(ctx: SampleContext, query: LogicalPlan): LogicalPlan = withOrigin(ctx) {


We should also update method doc.

viirya · 2021-10-30T00:31:30Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+  test("TABLE SAMPLE") {
+    withTable("test") {
+      sql("CREATE TABLE test(c int) USING PARQUET")
+      for( i <- 0 to 20) {


nit: for (i <- 0 to 20)

viirya

Looks okay. Just a few minor comments.

SparkQA · 2021-10-30T02:11:21Z

Test build #144769 has finished for PR 34442 at commit 5709a89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-30T04:20:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49243/

SparkQA · 2021-10-30T04:59:34Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49243/

SparkQA · 2021-10-30T09:07:49Z

Test build #144775 has finished for PR 34442 at commit 01a7e0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2021-10-30T16:29:08Z

Merged to mater. Thanks a lot for reviewing! @viirya

huaxingao · 2021-11-01T01:54:35Z

FYI @cloud-fan

cloud-fan · 2021-11-01T04:31:29Z

late LGTM

RossKen · 2023-04-06T13:26:45Z

Was this merged in the end? I can only see the PR as closed (not merged) and I can't see the functionality in the spark SQL docs or the codebase - but aware I may be missing something!

huaxingao · 2023-04-06T15:33:27Z

@RossKen This PR was merged.

huaxingao added 2 commits October 29, 2021 08:26

Add REPEATABLE in TABLESAMPLE to specify seed

0270b2a

add more test

5709a89

github-actions bot added DOCS SQL labels Oct 29, 2021

huaxingao mentioned this pull request Oct 29, 2021

[SPARK-37038][SQL][WIP] DSV2 Sample Push Down #34311

Closed

viirya reviewed Oct 30, 2021

View reviewed changes

address comments

01a7e0f

viirya approved these changes Oct 30, 2021

View reviewed changes

huaxingao closed this in b0548c6 Oct 30, 2021

huaxingao deleted the sample_syntax branch October 30, 2021 16:29

[SPARK-37165][SQL] Add REPEATABLE in TABLESAMPLE to specify seed #34442

[SPARK-37165][SQL] Add REPEATABLE in TABLESAMPLE to specify seed #34442

Uh oh!

Conversation

huaxingao commented Oct 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 29, 2021

Uh oh!

SparkQA commented Oct 29, 2021

Uh oh!

huaxingao commented Oct 29, 2021

Uh oh!

SparkQA commented Oct 29, 2021

Uh oh!

huaxingao commented Oct 29, 2021

Uh oh!

SparkQA commented Oct 29, 2021

Uh oh!

SparkQA commented Oct 29, 2021

Uh oh!

viirya Oct 30, 2021

Choose a reason for hiding this comment

Uh oh!

viirya Oct 30, 2021

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 30, 2021

Uh oh!

SparkQA commented Oct 30, 2021

Uh oh!

SparkQA commented Oct 30, 2021

Uh oh!

SparkQA commented Oct 30, 2021

Uh oh!

huaxingao commented Oct 30, 2021

Uh oh!

huaxingao commented Nov 1, 2021

Uh oh!

cloud-fan commented Nov 1, 2021

Uh oh!

RossKen commented Apr 6, 2023

Uh oh!

huaxingao commented Apr 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

huaxingao commented Oct 29, 2021 •

edited

Loading