-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14954][SQL] Add PARTITIONED BY and CLUSTERED BY clause for data source CTAS syntax #12734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also checks for metastore table properties
|
Test build #57127 has finished for PR 12734 at commit
|
|
Test build #57129 has finished for PR 12734 at commit
|
|
For UPDATE: DataFrameWriter's sortBy does require bucketBy |
| table, provider, temp, partitionColumnNames, bucketSpec, mode, options, query) | ||
| } else { | ||
| val struct = Option(ctx.colTypeList).map(createStructType) | ||
| val struct = Option(ctx.colTypeList()).map(createStructType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the command is not CTAS statement, seems we should throw exceptions if users define any of PARTITIONED BY, SORTED BY, or CLUSTERED BY clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that is not very related to this pr. I always find that the keyword CLUSTERED BY is very confusing, because there is a CLUSTER BY keyword (, which is DISTRIBUTE BY + SORT BY). But, we do not need to change it right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to add the check for this else branch and add some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sorry. PARTITIONED BY and CLUSTERED BY are both associated with CREATE TABLE USING AS SELECT rule. So, for CREATE TABLE USING, if PARTITIONED BY or CLUSTERED PY is provided, we already throw an exception.
|
Does this pr fix a ticket? In that case it would be useful to change the title to include the [SPARK-] prefix so that the JIRA status gets updated |
|
Yea. https://issues.apache.org/jira/browse/SPARK-14954 is the jira. |
|
Could you change the title to |
|
oh, I cannot change it. @liancheng will change the title after he gets up :) |
|
@liancheng The last commit adds a new test. |
|
Changes look good to me. |
|
Test build #57161 has finished for PR 12734 at commit
|
|
I fixed the title while merging to master. |
|
@jodersky Oh sorry, pasted the JIRA ticket summary to the PR title but forgot to add the tags. Updated! |
What changes were proposed in this pull request?
Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL:
How was this patch tested?
Test cases are added in
MetastoreDataSourcesSuiteto check the newly added syntax.