[SPARK-14954][SQL] Add PARTITIONED BY and CLUSTERED BY clause for data source CTAS syntax #12734

liancheng · 2016-04-27T13:34:25Z

What changes were proposed in this pull request?

Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL:

CREATE TABLE <table-name>
USING <provider> [OPTIONS (<key1> <value1>, <key2> <value2>, ...)]
[PARTITIONED BY (col1, col2, ...)]
[CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO <n> BUCKETS]
AS SELECT ...

How was this patch tested?

Test cases are added in MetastoreDataSourcesSuite to check the newly added syntax.

…" syntax

Also checks for metastore table properties

SparkQA · 2016-04-27T14:46:25Z

Test build #57127 has finished for PR 12734 at commit af973d6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-27T16:00:51Z

Test build #57129 has finished for PR 12734 at commit a193faf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-04-27T17:36:02Z

For DataFrameWriter, can we do sortBy without using bucketBy?

UPDATE: DataFrameWriter's sortBy does require bucketBy

yhuai · 2016-04-27T17:38:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+        table, provider, temp, partitionColumnNames, bucketSpec, mode, options, query)
    } else {
-      val struct = Option(ctx.colTypeList).map(createStructType)
+      val struct = Option(ctx.colTypeList()).map(createStructType)


If the command is not CTAS statement, seems we should throw exceptions if users define any of PARTITIONED BY, SORTED BY, or CLUSTERED BY clause?

One thing that is not very related to this pr. I always find that the keyword CLUSTERED BY is very confusing, because there is a CLUSTER BY keyword (, which is DISTRIBUTE BY + SORT BY). But, we do not need to change it right now.

I am going to add the check for this else branch and add some tests.

oh, sorry. PARTITIONED BY and CLUSTERED BY are both associated with CREATE TABLE USING AS SELECT rule. So, for CREATE TABLE USING, if PARTITIONED BY or CLUSTERED PY is provided, we already throw an exception.

jodersky · 2016-04-27T18:40:35Z

Does this pr fix a ticket? In that case it would be useful to change the title to include the [SPARK-] prefix so that the JIRA status gets updated

yhuai · 2016-04-27T18:43:36Z

Yea. https://issues.apache.org/jira/browse/SPARK-14954 is the jira.

jodersky · 2016-04-27T18:47:02Z

Could you change the title to [SPARK-14954] (current title)?

yhuai · 2016-04-27T18:50:31Z

oh, I cannot change it. @liancheng will change the title after he gets up :)

yhuai · 2016-04-27T19:17:16Z

@liancheng The last commit adds a new test.

yhuai · 2016-04-27T19:17:29Z

Changes look good to me.

SparkQA · 2016-04-27T20:52:20Z

Test build #57161 has finished for PR 12734 at commit 442265e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-04-27T20:56:56Z

I fixed the title while merging to master.

liancheng · 2016-04-28T05:01:19Z

@jodersky Oh sorry, pasted the JIRA ticket summary to the PR title but forgot to add the tags. Updated!

liancheng added 2 commits April 27, 2016 21:06

Add PARTITION BY and BUCKET BY clause for "CREATE TABLE ... USING ...…

f51300c

…" syntax

Moves test case to MetastoreDataSourcesSuite

af973d6

Also checks for metastore table properties

liancheng changed the title ~~Add PARTITION BY and BUCKET BY clause for "CREATE TABLE ... USING ..." syntax~~ Add PARTITION BY and BUCKET BY clause for data source CTAS syntax Apr 27, 2016

Reverts unnecessary parser changes

a193faf

liancheng force-pushed the spark-14954 branch from f6851bd to a193faf Compare April 27, 2016 14:19

yhuai reviewed Apr 27, 2016
View reviewed changes

Merge remote-tracking branch 'upstream/master' into spark-14954

3df4d47

Update tests

442265e

asfgit closed this in 24bea00 Apr 27, 2016

xwu0226 mentioned this pull request Apr 28, 2016

[SPARK-14346][SQL] Show Create Table (Native) #12579

Closed

liancheng changed the title ~~Add PARTITION BY and BUCKET BY clause for data source CTAS syntax~~ [SPARK-14346][SQL] Add PARTITIONED BY and BUCKETED BY clause for data source CTAS syntax Apr 28, 2016

liancheng changed the title ~~[SPARK-14346][SQL] Add PARTITIONED BY and BUCKETED BY clause for data source CTAS syntax~~ [SPARK-14346][SQL] Add PARTITIONED BY and CLUSTERED BY clause for data source CTAS syntax Apr 28, 2016

liancheng deleted the spark-14954 branch April 28, 2016 05:00

liancheng changed the title ~~[SPARK-14346][SQL] Add PARTITIONED BY and CLUSTERED BY clause for data source CTAS syntax~~ [SPARK-14954][SQL] Add PARTITIONED BY and CLUSTERED BY clause for data source CTAS syntax Apr 28, 2016

[SPARK-14954][SQL] Add PARTITIONED BY and CLUSTERED BY clause for data source CTAS syntax #12734

[SPARK-14954][SQL] Add PARTITIONED BY and CLUSTERED BY clause for data source CTAS syntax #12734

Uh oh!

Conversation

liancheng commented Apr 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 27, 2016

Uh oh!

SparkQA commented Apr 27, 2016

Uh oh!

yhuai commented Apr 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhuai Apr 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai Apr 27, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai Apr 27, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai Apr 27, 2016

Choose a reason for hiding this comment

Uh oh!

jodersky commented Apr 27, 2016

Uh oh!

yhuai commented Apr 27, 2016

Uh oh!

jodersky commented Apr 27, 2016

Uh oh!

yhuai commented Apr 27, 2016

Uh oh!

yhuai commented Apr 27, 2016

Uh oh!

yhuai commented Apr 27, 2016

Uh oh!

SparkQA commented Apr 27, 2016

Uh oh!

yhuai commented Apr 27, 2016

Uh oh!

liancheng commented Apr 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

liancheng commented Apr 27, 2016 •

edited

Loading

yhuai commented Apr 27, 2016 •

edited

Loading

yhuai Apr 27, 2016 •

edited

Loading