Skip to content

Conversation

@liancheng
Copy link
Contributor

@liancheng liancheng commented Apr 27, 2016

What changes were proposed in this pull request?

Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL:

CREATE TABLE <table-name>
USING <provider> [OPTIONS (<key1> <value1>, <key2> <value2>, ...)]
[PARTITIONED BY (col1, col2, ...)]
[CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO <n> BUCKETS]
AS SELECT ...

How was this patch tested?

Test cases are added in MetastoreDataSourcesSuite to check the newly added syntax.

@liancheng liancheng changed the title Add PARTITION BY and BUCKET BY clause for "CREATE TABLE ... USING ..." syntax Add PARTITION BY and BUCKET BY clause for data source CTAS syntax Apr 27, 2016
@SparkQA
Copy link

SparkQA commented Apr 27, 2016

Test build #57127 has finished for PR 12734 at commit af973d6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 27, 2016

Test build #57129 has finished for PR 12734 at commit a193faf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Apr 27, 2016

For DataFrameWriter, can we do sortBy without using bucketBy?

UPDATE: DataFrameWriter's sortBy does require bucketBy

table, provider, temp, partitionColumnNames, bucketSpec, mode, options, query)
} else {
val struct = Option(ctx.colTypeList).map(createStructType)
val struct = Option(ctx.colTypeList()).map(createStructType)
Copy link
Contributor

@yhuai yhuai Apr 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the command is not CTAS statement, seems we should throw exceptions if users define any of PARTITIONED BY, SORTED BY, or CLUSTERED BY clause?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that is not very related to this pr. I always find that the keyword CLUSTERED BY is very confusing, because there is a CLUSTER BY keyword (, which is DISTRIBUTE BY + SORT BY). But, we do not need to change it right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to add the check for this else branch and add some tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry. PARTITIONED BY and CLUSTERED BY are both associated with CREATE TABLE USING AS SELECT rule. So, for CREATE TABLE USING, if PARTITIONED BY or CLUSTERED PY is provided, we already throw an exception.

@jodersky
Copy link
Member

Does this pr fix a ticket? In that case it would be useful to change the title to include the [SPARK-] prefix so that the JIRA status gets updated

@yhuai
Copy link
Contributor

yhuai commented Apr 27, 2016

Yea. https://issues.apache.org/jira/browse/SPARK-14954 is the jira.

@jodersky
Copy link
Member

Could you change the title to [SPARK-14954] (current title)?

@yhuai
Copy link
Contributor

yhuai commented Apr 27, 2016

oh, I cannot change it. @liancheng will change the title after he gets up :)

@yhuai
Copy link
Contributor

yhuai commented Apr 27, 2016

@liancheng The last commit adds a new test.

@yhuai
Copy link
Contributor

yhuai commented Apr 27, 2016

Changes look good to me.

@SparkQA
Copy link

SparkQA commented Apr 27, 2016

Test build #57161 has finished for PR 12734 at commit 442265e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Apr 27, 2016

I fixed the title while merging to master.

@asfgit asfgit closed this in 24bea00 Apr 27, 2016
@liancheng liancheng changed the title Add PARTITION BY and BUCKET BY clause for data source CTAS syntax [SPARK-14346][SQL] Add PARTITIONED BY and BUCKETED BY clause for data source CTAS syntax Apr 28, 2016
@liancheng liancheng changed the title [SPARK-14346][SQL] Add PARTITIONED BY and BUCKETED BY clause for data source CTAS syntax [SPARK-14346][SQL] Add PARTITIONED BY and CLUSTERED BY clause for data source CTAS syntax Apr 28, 2016
@liancheng liancheng deleted the spark-14954 branch April 28, 2016 05:00
@liancheng
Copy link
Contributor Author

@jodersky Oh sorry, pasted the JIRA ticket summary to the PR title but forgot to add the tags. Updated!

@liancheng liancheng changed the title [SPARK-14346][SQL] Add PARTITIONED BY and CLUSTERED BY clause for data source CTAS syntax [SPARK-14954][SQL] Add PARTITIONED BY and CLUSTERED BY clause for data source CTAS syntax Apr 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants