[SPARK-14346] [SQL][WIP]Show create table #12132

xwu0226 · 2016-04-02T20:04:18Z

What changes were proposed in this pull request?

add SHOW CREATE TABLE tableIdentifier in SqlBase.g4 file
create a RunnableCommand class for ShowCreateTableCommand, that invokes SessionCatalog.generateTableDDL
add generateTableDDL method in SessionCatalogand override it in HiveSessionCatalog.
HiveSessionCatalog.generateTableDDL is to create DDL for metastore tables. and for tempTables in Spark, it may not be quite practical. For now, we throw exception in SessionCatalog.generateTableDDL. If the community thinks it is also necessary, we can implement this later.
HiveSessionCatalog.generateTableDDL branches out to generate Hive syntax DDL for hive tables, while it is also supposed to branch out to generate DataSource syntax DDL for datasource tables.
Limitation. While the DDL work has been going on, some datasource table features are not supported yet. So the generation of DDL is also limited. For example:
If the table is created asdf.write.partitionBy("a").saveAsTable("t1"), this datasource table will record the partitioning columns in tblproperties of the metastore table. But, when the datasource syntax DDL is generated based on this table, we will lose the partitioning information in the DDL. Similarly for bucketing and skew information. etc.
Some follow-up PRs for TODOs:

For some cases, where "create table" used to issue warnings and may throw exception instead, show create table need to adjust to it.
After datasource table DDL can support partitioning, bucketing, skewing, show create table DDL needs to support.
Complete the support of SORTCOLUMNS for bucketing and StorageHandler inforamtion in CatalogTable class, then, we can support "SORTED BY", "SKEWED BY", "STORED BY" clause in showing Hive syntax DDL
Support show create table index?

How was this patch tested?

unite tests with self-check. run the hive, sql, catalyst sbt tests.

gatorsmile · 2016-04-02T20:34:52Z

Please change the title to [SPARK-14346] [SQL][WIP] Native Show Create Table Command Thanks!

andrewor14 · 2016-04-04T20:33:11Z

ok to test

SparkQA · 2016-04-04T20:39:26Z

Test build #54891 has finished for PR 12132 at commit 850edbe.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

xwu0226 · 2016-04-04T22:00:44Z

@andrewor14 , Thank you for triggering off the test. I just pushed the fix for the scala style issues. Thanks again!

SparkQA · 2016-04-04T22:04:25Z

Test build #54898 has finished for PR 12132 at commit b30b0c9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-04T23:53:16Z

Test build #54903 has finished for PR 12132 at commit 3d40c70.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-05T03:28:50Z

Test build #54920 has finished for PR 12132 at commit f8feff9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HiveShowDDLSuite extends QueryTest with SQLTestUtils with TestHiveSingleton

gatorsmile · 2016-04-05T04:38:58Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveShowDDLSuite.scala

Please do not forget to remove all these lines of show(false) in the test cases.

gatorsmile · 2016-04-05T04:40:29Z

Please move these test cases from black_list to white_list in HiveCompatibilitySuite

```
"show_create_table_alter",
```
```
"show_create_table_db_table",
```
```
"show_create_table_delimited",
```
```
"show_create_table_does_not_exist",
```
```
"show_create_table_index",
```
```
"show_create_table_partitioned",
```
```
"show_create_table_serde",
```
```
"show_create_table_view",
```

Thanks!

xwu0226 · 2016-04-05T04:48:53Z

@gatorsmile Thanks, Xiao! I will do that.

This patch upgrades Chill to 0.8.0 and Kryo to 3.0.3. While we'll likely need to bump these dependencies again before Spark 2.0 (due to SPARK-14221 / twitter/chill#252), I wanted to get the bulk of the Kryo 2 -> Kryo 3 migration done now in order to figure out whether there are any unexpected surprises. Author: Josh Rosen <[email protected]> Closes apache#12076 from JoshRosen/kryo3.

…stens to create BlockManagerId ## What changes were proposed in this pull request? Here is why SPARK-14437 happens: BlockManagerId is created using NettyBlockTransferService.hostName which comes from `customHostname`. And `Executor` will set `customHostname` to the hostname which is detected by the driver. However, the driver may not be able to detect the correct address in some complicated network (Netty's Channel.remoteAddress doesn't always return a connectable address). In such case, `BlockManagerId` will be created using a wrong hostname. To fix this issue, this PR uses `hostname` provided by `SparkEnv.create` to create `NettyBlockTransferService` and set `NettyBlockTransferService.hostname` to this one directly. A bonus of this approach is NettyBlockTransferService won't bound to `0.0.0.0` which is much safer. ## How was this patch tested? Manually checked the bound address using local-cluster. Author: Shixiong Zhu <[email protected]> Closes apache#12240 from zsxwing/SPARK-14437.

## What changes were proposed in this pull request? This patch adds support for better handling of exceptions inside catch blocks if the code within the block throws an exception. For instance here is the code in a catch block before this change in `WriterContainer.scala`: ```scala logError("Aborting task.", cause) // call failure callbacks first, so we could have a chance to cleanup the writer. TaskContext.get().asInstanceOf[TaskContextImpl].markTaskFailed(cause) if (currentWriter != null) { currentWriter.close() } abortTask() throw new SparkException("Task failed while writing rows.", cause) ``` If `markTaskFailed` or `currentWriter.close` throws an exception, we currently lose the original cause. This PR fixes this problem by implementing a utility function `Utils.tryWithSafeCatch` that suppresses (`Throwable.addSuppressed`) the exception that are thrown within the catch block and rethrowing the original exception. ## How was this patch tested? No new functionality added Author: Sameer Agarwal <[email protected]> Closes apache#12234 from sameeragarwal/fix-exception.

SparkQA · 2016-04-09T00:48:41Z

Test build #55405 has finished for PR 12132 at commit 9ab863f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? Cleanups to documentation. No changes to code. * GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor * GLM regParam: needs doc saying it is for L2 only * TrainValidationSplitModel: add .. versionadded:: 2.0.0 * Rename “_transformer_params_from_java” to “_transfer_params_from_java” * LogReg Summary classes: “probability” col should not say “calibrated” * LR summaries: coefficientStandardErrors —> document that intercept stderr comes last. Same for t,p-values * approxCountDistinct: Document meaning of “rsd" argument. * LDA: note which params are for online LDA only ## How was this patch tested? Doc build Author: Joseph K. Bradley <[email protected]> Closes apache#12266 from jkbradley/ml-doc-cleanups.

## What changes were proposed in this pull request? Based on our tests, gzip decompression is very slow (< 100MB/s), making queries decompression bound. Snappy can decompress at ~ 500MB/s on a single core. This patch changes the default compression codec for Parquet output from gzip to snappy, and also introduces a ParquetOptions class to be more consistent with other data sources (e.g. CSV, JSON). ## How was this patch tested? Should be covered by existing unit tests. Author: Reynold Xin <[email protected]> Closes apache#12256 from rxin/SPARK-14482.

## What changes were proposed in this pull request? When we first introduced Aggregators, we required the user of Aggregators to (implicitly) specify the encoders. It would actually make more sense to have the encoders be specified by the implementation of Aggregators, since each implementation should have the most state about how to encode its own data type. Note that this simplifies the Java API because Java users no longer need to explicitly specify encoders for aggregators. ## How was this patch tested? Updated unit tests. Author: Reynold Xin <[email protected]> Closes apache#12231 from rxin/SPARK-14451.

…FileSourceStrategy ## What changes were proposed in this pull request? When prune the partitions or push down predicates, case-sensitivity is not respected. In order to make it work with case-insensitive, this PR update the AttributeReference inside predicate to use the name from schema. ## How was this patch tested? Add regression tests for case-insensitive. Author: Davies Liu <[email protected]> Closes apache#12371 from davies/case_insensi.

…ed imports ## What changes were proposed in this pull request? Old `HadoopFsRelation` API includes `buildInternalScan()` which uses `SqlNewHadoopRDD` in `ParquetRelation`. Because now the old API is removed, `SqlNewHadoopRDD` is not used anymore. So, this PR removes `SqlNewHadoopRDD` and several unused imports. This was discussed in apache#12326. ## How was this patch tested? Several related existing unit tests and `sbt scalastyle`. Author: hyukjinkwon <[email protected]> Closes apache#12354 from HyukjinKwon/SPARK-14596.

## What changes were proposed in this pull request? The PyDoc Makefile used "=" rather than "?=" for setting env variables so it overwrote the user values. This ignored the environment variables we set for linting allowing warnings through. This PR also fixes the warnings that had been introduced. ## How was this patch tested? manual local export & make Author: Holden Karau <[email protected]> Closes apache#12336 from holdenk/SPARK-14573-fix-pydoc-makefile.

…rmations ## What changes were proposed in this pull request? This PR removes extra anonymous closure within functional transformations. For example, ```scala .map(item => { ... }) ``` which can be just simply as below: ```scala .map { item => ... } ``` ## How was this patch tested? Related unit tests and `sbt scalastyle`. Author: hyukjinkwon <[email protected]> Closes apache#12382 from HyukjinKwon/minor-extra-closers.

#### What changes were proposed in this pull request? **HQL Syntax**: [Create View](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView ) ```SQL CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ] [COMMENT view_comment] [TBLPROPERTIES (property_name = property_value, ...)] AS SELECT ...; ``` Add a support for the `[COMMENT view_comment]` clause #### How was this patch tested? Modified the existing test cases to verify the correctness. Author: gatorsmile <[email protected]> Author: xiaoli <[email protected]> Author: Xiao Li <[email protected]> Closes apache#12288 from gatorsmile/addCommentInCreateView.

## What changes were proposed in this pull request? The configuration docs are updated to reflect the changes introduced with [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This allows the user to specify initial heap memory settings through the extraJavaOptions for executor, driver and am. ## How was this patch tested? The changes are tested in [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This is just documenting the changes made. Author: Dhruve Ashar <[email protected]> Closes apache#12333 from dhruve/doc/SPARK-14572.

#### What changes were proposed in this pull request? This PR is to provide a native DDL support for the following three Alter View commands: Based on the Hive DDL document: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL ##### 1. ALTER VIEW RENAME **Syntax:** ```SQL ALTER VIEW view_name RENAME TO new_view_name ``` - to change the name of a view to a different name - not allowed to rename a view's name by ALTER TABLE ##### 2. ALTER VIEW SET TBLPROPERTIES **Syntax:** ```SQL ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment); ``` - to add metadata to a view - not allowed to set views' properties by ALTER TABLE - ignore it if trying to set a view's existing property key when the value is the same - overwrite the value if trying to set a view's existing key to a different value ##### 3. ALTER VIEW UNSET TBLPROPERTIES **Syntax:** ```SQL ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key') ``` - to remove metadata from a view - not allowed to unset views' properties by ALTER TABLE - issue an exception if trying to unset a view's non-existent key #### How was this patch tested? Added test cases to verify if it works properly. Author: gatorsmile <[email protected]> Author: xiaoli <[email protected]> Author: Xiao Li <[email protected]> Closes apache#12324 from gatorsmile/alterView.

xwu0226 · 2016-04-14T17:00:25Z

@yhuai Another question.. Currently, for the tables that have spark.sql.sources.provider in table properties, I generate DDL with datasource table syntax, such as create table ... USING .. OPTIONS (...). However, such syntax does not support partitioning, bucketing, sorting features. We will lose such information, if the table is created as df.write.partitionBy(...).bucketBy(...).saveAsTable(...)
Should we also create hive DDL syntax for datasource table then? Thanks!

yhuai · 2016-04-14T17:10:34Z

oh i see. How about we also show how to create the table in DataFrame API for now?

## What changes were proposed in this pull request? I was trying to understand the accumulator and metrics update source code and these two classes don't really need to be case classes. It would also be more consistent with other UI classes if they are not case classes. This is part of my bigger effort to simplify accumulators and task metrics. ## How was this patch tested? This is a straightforward refactoring without behavior change. Author: Reynold Xin <[email protected]> Closes apache#12386 from rxin/SPARK-14625.

…t methods should have explicit return types ## What changes were proposed in this pull request? Currently many public abstract methods (in abstract classes as well as traits) don't declare return types explicitly, such as in [o.a.s.streaming.dstream.InputDStream](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/InputDStream.scala#L110): ```scala def start() // should be: def start(): Unit def stop() // should be: def stop(): Unit ``` These methods exist in core, sql, streaming; this PR fixes them. ## How was this patch tested? N/A ## Which piece of scala style rule led to the changes? the rule was added separately in apache#12396 Author: Liwei Lin <[email protected]> Closes apache#12389 from lw-lin/public-abstract-methods.

xwu0226 · 2016-04-14T17:27:50Z

So we have 3 cases:

Tables created with Hive syntax -- we generate DDL in hive syntax
Tables created with datasource syntax -- we generate DDL in datasource syntax ...USING...OPTIONS(...)
Tables created from dataframe -- If the table metadata contains the features that are not supported in datasource syntax, generate create table with dataframe API. The reason for this condition is that under the cover there is no indication of whether the table is created with dataframe API or not, it is just a datasource table by spark.sql.sources.provider.
@yhuai what do you think? Thanks!

…d mllib-local into one place ## What changes were proposed in this pull request? Move json4s, breeze dependency declaration into parent ## How was this patch tested? Should be no functional change, but Jenkins tests will test that. Author: Sean Owen <[email protected]> Closes apache#12390 from srowen/SPARK-14612.

## What changes were proposed in this pull request? When there are multiple attempts for a stage, we currently only reset internal accumulator values if all the tasks are resubmitted. It would make more sense to reset the accumulator values for each stage attempt. This will allow us to eventually get rid of the internal flag in the Accumulator class. This is part of my bigger effort to simplify accumulators and task metrics. ## How was this patch tested? Covered by existing tests. Author: Reynold Xin <[email protected]> Closes apache#12378 from rxin/SPARK-14619.

## What changes were proposed in this pull request? This patch removes some of the deprecated APIs in TaskMetrics. This is part of my bigger effort to simplify accumulators and task metrics. ## How was this patch tested? N/A - only removals Author: Reynold Xin <[email protected]> Closes apache#12375 from rxin/SPARK-14617.

…s a REPL line object ## What changes were proposed in this pull request? When we clean a closure, if its outermost parent is not a closure, we won't clone and clean it as cloning user's objects is dangerous. However, if it's a REPL line object, which may carry a lot of unnecessary references(like hadoop conf, spark conf, etc.), we should clean it as it's not a user object. This PR improves the check for user's objects to exclude REPL line object. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#12327 from cloud-fan/closure.

…nal Tables #### What changes were proposed in this pull request? This PR is to add a test to ensure drop partitions of an external table will not delete data. cc yhuai andrewor14 #### How was this patch tested? N/A Author: gatorsmile <[email protected]> This patch had conflicts when merged, resolved by Committer: Andrew Or <[email protected]> Closes apache#12350 from gatorsmile/testDropPartition.

## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-14592 This patch adds native support for DDL command `CREATE TABLE LIKE`. The SQL syntax is like: CREATE TABLE table_name LIKE existing_table CREATE TABLE IF NOT EXISTS table_name LIKE existing_table ## How was this patch tested? `HiveDDLCommandSuite`. `HiveQuerySuite` already tests `CREATE TABLE LIKE`. Author: Liang-Chi Hsieh <[email protected]> This patch had conflicts when merged, resolved by Committer: Andrew Or <[email protected]> Closes apache#12362 from viirya/create-table-like.

Added binary toggle param to CountVectorizer feature transformer in PySpark. Created a unit test for using CountVectorizer with the binary toggle on. Author: Bryan Cutler <[email protected]> Closes apache#12308 from BryanCutler/binary-param-python-CountVectorizer-SPARK-13967.

## What changes were proposed in this pull request? In Spark 1.4, we negated some metrics from RegressionEvaluator since CrossValidator always maximized metrics. This was fixed in 1.5, but the docs were not updated. This PR updates the docs. ## How was this patch tested? no tests Author: Joseph K. Bradley <[email protected]> Closes apache#12377 from jkbradley/regeval-doc.

…HashingTF in ML & MLlib ## What changes were proposed in this pull request? This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1. Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done. ## How was this patch tested? This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib. Author: Yong Tang <[email protected]> Closes apache#12079 from yongtang/SPARK-14238.

## What changes were proposed in this pull request? Current `LikeSimplification` handles the following four rules. - 'a%' => expr.StartsWith("a") - '%b' => expr.EndsWith("b") - '%a%' => expr.Contains("a") - 'a' => EqualTo("a") This PR adds the following rule. - 'a%b' => expr.Length() >= 2 && expr.StartsWith("a") && expr.EndsWith("b") Here, 2 is statically calculated from "a".size + "b".size. **Before** ``` scala> sql("select a from (select explode(array('abc','adc')) a) T where a like 'a%c'").explain() == Physical Plan == WholeStageCodegen : +- Filter a#5 LIKE a%c : +- INPUT +- Generate explode([abc,adc]), false, false, [a#5] +- Scan OneRowRelation[] ``` **After** ``` scala> sql("select a from (select explode(array('abc','adc')) a) T where a like 'a%c'").explain() == Physical Plan == WholeStageCodegen : +- Filter ((length(a#5) >= 2) && (StartsWith(a#5, a) && EndsWith(a#5, c))) : +- INPUT +- Generate explode([abc,adc]), false, false, [a#5] +- Scan OneRowRelation[] ``` ## How was this patch tested? Pass the Jenkins tests (including new testcase). Author: Dongjoon Hyun <[email protected]> Closes apache#12312 from dongjoon-hyun/SPARK-14545.

… work

xwu0226 · 2016-04-15T01:37:56Z

Somehow my merge brought in a lot of other stuff in this PR. I need to close this PR and resubmit another PR.

SparkQA · 2016-04-15T02:58:57Z

Test build #55887 has finished for PR 12132 at commit 1680ea0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ShowCreateTableCommand(

gatorsmile · 2016-04-18T17:36:43Z

A new PR has been created. #12406

All the comments are addressed there. Thanks!

xwu0226 changed the title ~~Show create table~~ [SPARK-14346] [SQL]Show create table Apr 2, 2016

xwu0226 changed the title ~~[SPARK-14346] [SQL]Show create table~~ [SPARK-14346] [SQL][WIP]Show create table Apr 2, 2016

xwu0226 force-pushed the show_create_table_1 branch from 850edbe to b30b0c9 Compare April 4, 2016 21:59

gatorsmile reviewed Apr 5, 2016
View reviewed changes

xwu0226 added 9 commits April 8, 2016 01:05

show create table DDL -- hive metastore table

0ebb014

update upon review

6d060be

ignoring sqlContext temp table and considering datasource table ddl

2799672

fix scala style issue

98c020a

fix scala style issue in testcase

efd8898

fix testcase for test failure

b370630

continue the database ddl generation

8cb7a72

support datasource ddl

8b67d22

scala style fix

9ab863f

xwu0226 force-pushed the show_create_table_1 branch from f8feff9 to 9ab863f Compare April 8, 2016 23:14

JoshRosen and others added 3 commits April 8, 2016 16:35

jkbradley and others added 3 commits April 8, 2016 20:15

xwu0226 and others added 8 commits April 13, 2016 15:54

merge the code committed by CREATE TABLE native support

a40273c

rxin and others added 2 commits April 14, 2016 10:12

srowen and others added 12 commits April 14, 2016 10:48

rework show create ddl based on new native supported create table DDL…

d214a3b

… work

Merge branch 'show_create_table_1' into show_create_table_2

1680ea0

xwu0226 closed this Apr 15, 2016

xwu0226 mentioned this pull request Apr 21, 2016

[SPARK-14346][SQL] Show Create Table (Native) #12579

Closed

[SPARK-14346] [SQL][WIP]Show create table #12132

[SPARK-14346] [SQL][WIP]Show create table #12132

Uh oh!

Conversation

xwu0226 commented Apr 2, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented Apr 2, 2016

Uh oh!

andrewor14 commented Apr 4, 2016

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

xwu0226 commented Apr 4, 2016

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

SparkQA commented Apr 5, 2016

Uh oh!

gatorsmile Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Apr 5, 2016

Uh oh!

xwu0226 commented Apr 5, 2016

Uh oh!

SparkQA commented Apr 9, 2016

Uh oh!

xwu0226 commented Apr 14, 2016

Uh oh!

yhuai commented Apr 14, 2016

Uh oh!

xwu0226 commented Apr 14, 2016

Uh oh!

xwu0226 commented Apr 15, 2016

Uh oh!

SparkQA commented Apr 15, 2016

Uh oh!

gatorsmile commented Apr 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants