Skip to content

Conversation

@xwu0226
Copy link
Contributor

@xwu0226 xwu0226 commented Apr 21, 2016

This is a rebased version of #12132 and #12406

What changes were proposed in this pull request?

Allow users to issue "SHOW CREATE TABLE" command natively in SparkSQL.
-- For tables that are created by Hive, this command will display the DDL in hive syntax. If the syntax includes CLUSTERED BY, SKEWED BY or STORED BY clause, there will be a warning message saying that this DDL is not supported in SparkSQL native DDL yet.

-- For tables that are created by datasource DDL, such as "CREATE TABLE... USING ... OPTIONS (...)", it will show the DDL in this syntax.

-- For tables that are created by dataframe API, such as "df.write.partitionBy(...).saveAsTable(...)", currently the command will display DDL with the syntax "CREATE TABLE.. USING...OPTIONS(...)". However, this syntax lose the partitioning information. It is proposed to display create table in the dataframe API format, such as <DataFarme>.write.partitionBy("a").bucketBy("c").format("parquet").saveAsTable("T1")

How was this patch tested?

Unit tests are created.

@xwu0226 xwu0226 force-pushed the show_create_table_3 branch from ca44d67 to bd0d8f5 Compare April 21, 2016 21:34
@xwu0226
Copy link
Contributor Author

xwu0226 commented Apr 21, 2016

@yhuai @andrewor14 Thanks!

@liancheng
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Apr 25, 2016

Test build #56899 has finished for PR 12579 at commit 13e9775.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@xwu0226
Copy link
Contributor Author

xwu0226 commented Apr 25, 2016

@liancheng Thanks for triggering the test! I am looking into the test failure.

@xwu0226 xwu0226 force-pushed the show_create_table_3 branch from 13e9775 to 1b08feb Compare April 25, 2016 22:25
@gatorsmile
Copy link
Member

retest this please

@xwu0226 xwu0226 force-pushed the show_create_table_3 branch from 1b08feb to 9e39b5c Compare April 27, 2016 06:43
@xwu0226
Copy link
Contributor Author

xwu0226 commented Apr 28, 2016

@yhuai @liancheng , I see PR #12734 takes care of the PARTITIONED BY and CLUSTERED BY (with SORTED BY) clause for CTAS syntax, but not for non-CTAS syntax. Now I need to change my PR to adapt to this change, which means that the generated DDL will be something like create table t1 (c1 int, ...) using .. options (..) partitioned by (..) clustered by (...) sorted by (...) in ... buckets. There may not be a "select clause" following it since we do not have the original query. But such generated query will not run because #12734 does not support it. Can we add a fake select clause with a warning message?

Also DataFrameWriter.saveAsTable case is like CTAS. Can we then generate the DDL as a regular CTAS syntax? This will change my current implementation in this PR.
Please advice, thanks a lot!

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented May 10, 2016

@xwu0226 I think this is superseded by #12781 ?

@xwu0226
Copy link
Contributor Author

xwu0226 commented May 10, 2016

@srowen Yes, for datasource table. This PR also includes the work for hive syntax DDL too. I see #12781 mentions that there will be followup PR taking care of the hive syntax DDL. So I wondering whether I should continue on this PR. I can close this one if there is no need. Thanks!

@liancheng
Copy link
Contributor

liancheng commented May 11, 2016

Hey @xwu0226, sorry that I didn't explain why I opened another PR for the same issue, was in code rush for 2.0...

So one of the considerations for all the native DDL commands is that we don't want these DDL commands to rely on Hive anymore. This is because we'd like to remove Hive dependency from Spark SQL core and gradually make Hive a separate data source in the future. This means, we shouldn't add new code in places like HiveClientImpl. These new DDL command should be implemented upon interfaces like CatalogTable.

One apparent problem of this approach is that, current Spark SQL interfaces don't capture all semantics of Hive. For example, some table metadata like skew spec is not covered in CatalogTable yet. Our general strategies are:

  1. For easy ones, like "owner" and "compressed" in [SPARK-14127][SQL] Native "DESC [EXTENDED | FORMATTED] <table>" DDL command #12844, we may just add them to the interface and leverage them.
  2. For features that are not supported in Spark SQL, for example, skew spec, we can simply ignore them for now, since Spark can't handle them anyway.

There will be a follow-up of #12781 to add support for Hive tables. After offline discussion with @yhuai, we decided to add a flag in CatalogTable to indicate whether there are unrecognized metadata provided by the underlying external catalog, but not translated and included in CatalogTable. In this way, when applying SHOW CREATE TABLE to tables containing such metadata, this flag can be set to true, and we can simply refuse to output anything by checking this flag. This makes sense because even if you add things like skew spec in the result of SHOW CREATE TABLE, Spark can't handle the generated DDL statement

@xwu0226
Copy link
Contributor Author

xwu0226 commented May 11, 2016

@liancheng Thank you for the detail explanation!! Yeah. if the goal is to make sure Spark SQL can handle the generated DDL, then, we need to miss some hive features for now. I will close this PR.

@xwu0226 xwu0226 closed this May 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants