Skip to content

Conversation

@huaxingao
Copy link
Contributor

@huaxingao huaxingao commented Oct 29, 2021

What changes were proposed in this pull request?

Add REPEATABLE in SQL syntax TABLESAMPLE so user can specify seed.

Why are the changes needed?

Current syntax for TABLESAMPLE:

  • TABLESAMPLE(x PERCENT)
  • TABLESAMPLE(BUCKET x OUT OF y)

Dataset.sample has a param to specify seed, so we should allow SQL has a way to specify seed too.

  def sample(fraction: Double, seed: Long): Dataset[T] = {
    sample(withReplacement = false, fraction = fraction, seed = seed)
  }

Most of the DBMS uses REPEATABLE to let user specify seed, e.g. DB2, we will follow the same way.

Screen Shot 2021-10-29 at 8 46 04 AM

Does this PR introduce any user-facing change?

Yes
new SQL syntax

  • TABLESAMPLE(x PERCENT) [REPEATABLE (seed)]
  • TABLESAMPLE(BUCKET x OUT OF y) [REPEATABLE (seed)]

How was this patch tested?

new UT

@SparkQA
Copy link

SparkQA commented Oct 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49230/

@SparkQA
Copy link

SparkQA commented Oct 29, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49230/

@huaxingao
Copy link
Contributor Author

cc @viirya

@SparkQA
Copy link

SparkQA commented Oct 29, 2021

Test build #144761 has finished for PR 34442 at commit 5709a89.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49238/

@SparkQA
Copy link

SparkQA commented Oct 29, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49238/

* are defined as a number between 0 and 100.
* - TABLESAMPLE(BUCKET x OUT OF y): Sample the table down to a 'x' divided by 'y' fraction.
*/
private def withSample(ctx: SampleContext, query: LogicalPlan): LogicalPlan = withOrigin(ctx) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also update method doc.

test("TABLE SAMPLE") {
withTable("test") {
sql("CREATE TABLE test(c int) USING PARQUET")
for( i <- 0 to 20) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for (i <- 0 to 20)

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay. Just a few minor comments.

@SparkQA
Copy link

SparkQA commented Oct 30, 2021

Test build #144769 has finished for PR 34442 at commit 5709a89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49243/

@SparkQA
Copy link

SparkQA commented Oct 30, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49243/

@SparkQA
Copy link

SparkQA commented Oct 30, 2021

Test build #144775 has finished for PR 34442 at commit 01a7e0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao huaxingao closed this in b0548c6 Oct 30, 2021
@huaxingao
Copy link
Contributor Author

Merged to mater. Thanks a lot for reviewing! @viirya

@huaxingao huaxingao deleted the sample_syntax branch October 30, 2021 16:29
@huaxingao
Copy link
Contributor Author

FYI @cloud-fan

@cloud-fan
Copy link
Contributor

late LGTM

@RossKen
Copy link

RossKen commented Apr 6, 2023

Was this merged in the end? I can only see the PR as closed (not merged) and I can't see the functionality in the spark SQL docs or the codebase - but aware I may be missing something!

@huaxingao
Copy link
Contributor Author

@RossKen This PR was merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants