-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14346] [SQL][WIP]Show create table #12132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Please change the title to |
|
ok to test |
|
Test build #54891 has finished for PR 12132 at commit
|
850edbe to
b30b0c9
Compare
|
@andrewor14 , Thank you for triggering off the test. I just pushed the fix for the scala style issues. Thanks again! |
|
Test build #54898 has finished for PR 12132 at commit
|
|
Test build #54903 has finished for PR 12132 at commit
|
|
Test build #54920 has finished for PR 12132 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not forget to remove all these lines of show(false) in the test cases.
|
Please move these test cases from black_list to white_list in
Thanks! |
|
@gatorsmile Thanks, Xiao! I will do that. |
f8feff9 to
9ab863f
Compare
This patch upgrades Chill to 0.8.0 and Kryo to 3.0.3. While we'll likely need to bump these dependencies again before Spark 2.0 (due to SPARK-14221 / twitter/chill#252), I wanted to get the bulk of the Kryo 2 -> Kryo 3 migration done now in order to figure out whether there are any unexpected surprises. Author: Josh Rosen <[email protected]> Closes apache#12076 from JoshRosen/kryo3.
…stens to create BlockManagerId ## What changes were proposed in this pull request? Here is why SPARK-14437 happens: BlockManagerId is created using NettyBlockTransferService.hostName which comes from `customHostname`. And `Executor` will set `customHostname` to the hostname which is detected by the driver. However, the driver may not be able to detect the correct address in some complicated network (Netty's Channel.remoteAddress doesn't always return a connectable address). In such case, `BlockManagerId` will be created using a wrong hostname. To fix this issue, this PR uses `hostname` provided by `SparkEnv.create` to create `NettyBlockTransferService` and set `NettyBlockTransferService.hostname` to this one directly. A bonus of this approach is NettyBlockTransferService won't bound to `0.0.0.0` which is much safer. ## How was this patch tested? Manually checked the bound address using local-cluster. Author: Shixiong Zhu <[email protected]> Closes apache#12240 from zsxwing/SPARK-14437.
## What changes were proposed in this pull request?
This patch adds support for better handling of exceptions inside catch blocks if the code within the block throws an exception. For instance here is the code in a catch block before this change in `WriterContainer.scala`:
```scala
logError("Aborting task.", cause)
// call failure callbacks first, so we could have a chance to cleanup the writer.
TaskContext.get().asInstanceOf[TaskContextImpl].markTaskFailed(cause)
if (currentWriter != null) {
currentWriter.close()
}
abortTask()
throw new SparkException("Task failed while writing rows.", cause)
```
If `markTaskFailed` or `currentWriter.close` throws an exception, we currently lose the original cause. This PR fixes this problem by implementing a utility function `Utils.tryWithSafeCatch` that suppresses (`Throwable.addSuppressed`) the exception that are thrown within the catch block and rethrowing the original exception.
## How was this patch tested?
No new functionality added
Author: Sameer Agarwal <[email protected]>
Closes apache#12234 from sameeragarwal/fix-exception.
|
Test build #55405 has finished for PR 12132 at commit
|
## What changes were proposed in this pull request? Cleanups to documentation. No changes to code. * GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor * GLM regParam: needs doc saying it is for L2 only * TrainValidationSplitModel: add .. versionadded:: 2.0.0 * Rename “_transformer_params_from_java” to “_transfer_params_from_java” * LogReg Summary classes: “probability” col should not say “calibrated” * LR summaries: coefficientStandardErrors —> document that intercept stderr comes last. Same for t,p-values * approxCountDistinct: Document meaning of “rsd" argument. * LDA: note which params are for online LDA only ## How was this patch tested? Doc build Author: Joseph K. Bradley <[email protected]> Closes apache#12266 from jkbradley/ml-doc-cleanups.
## What changes were proposed in this pull request? Based on our tests, gzip decompression is very slow (< 100MB/s), making queries decompression bound. Snappy can decompress at ~ 500MB/s on a single core. This patch changes the default compression codec for Parquet output from gzip to snappy, and also introduces a ParquetOptions class to be more consistent with other data sources (e.g. CSV, JSON). ## How was this patch tested? Should be covered by existing unit tests. Author: Reynold Xin <[email protected]> Closes apache#12256 from rxin/SPARK-14482.
## What changes were proposed in this pull request? When we first introduced Aggregators, we required the user of Aggregators to (implicitly) specify the encoders. It would actually make more sense to have the encoders be specified by the implementation of Aggregators, since each implementation should have the most state about how to encode its own data type. Note that this simplifies the Java API because Java users no longer need to explicitly specify encoders for aggregators. ## How was this patch tested? Updated unit tests. Author: Reynold Xin <[email protected]> Closes apache#12231 from rxin/SPARK-14451.
…FileSourceStrategy ## What changes were proposed in this pull request? When prune the partitions or push down predicates, case-sensitivity is not respected. In order to make it work with case-insensitive, this PR update the AttributeReference inside predicate to use the name from schema. ## How was this patch tested? Add regression tests for case-insensitive. Author: Davies Liu <[email protected]> Closes apache#12371 from davies/case_insensi.
…ed imports ## What changes were proposed in this pull request? Old `HadoopFsRelation` API includes `buildInternalScan()` which uses `SqlNewHadoopRDD` in `ParquetRelation`. Because now the old API is removed, `SqlNewHadoopRDD` is not used anymore. So, this PR removes `SqlNewHadoopRDD` and several unused imports. This was discussed in apache#12326. ## How was this patch tested? Several related existing unit tests and `sbt scalastyle`. Author: hyukjinkwon <[email protected]> Closes apache#12354 from HyukjinKwon/SPARK-14596.
## What changes were proposed in this pull request? The PyDoc Makefile used "=" rather than "?=" for setting env variables so it overwrote the user values. This ignored the environment variables we set for linting allowing warnings through. This PR also fixes the warnings that had been introduced. ## How was this patch tested? manual local export & make Author: Holden Karau <[email protected]> Closes apache#12336 from holdenk/SPARK-14573-fix-pydoc-makefile.
…rmations
## What changes were proposed in this pull request?
This PR removes extra anonymous closure within functional transformations.
For example,
```scala
.map(item => {
...
})
```
which can be just simply as below:
```scala
.map { item =>
...
}
```
## How was this patch tested?
Related unit tests and `sbt scalastyle`.
Author: hyukjinkwon <[email protected]>
Closes apache#12382 from HyukjinKwon/minor-extra-closers.
#### What changes were proposed in this pull request? **HQL Syntax**: [Create View](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView ) ```SQL CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ] [COMMENT view_comment] [TBLPROPERTIES (property_name = property_value, ...)] AS SELECT ...; ``` Add a support for the `[COMMENT view_comment]` clause #### How was this patch tested? Modified the existing test cases to verify the correctness. Author: gatorsmile <[email protected]> Author: xiaoli <[email protected]> Author: Xiao Li <[email protected]> Closes apache#12288 from gatorsmile/addCommentInCreateView.
## What changes were proposed in this pull request? The configuration docs are updated to reflect the changes introduced with [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This allows the user to specify initial heap memory settings through the extraJavaOptions for executor, driver and am. ## How was this patch tested? The changes are tested in [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This is just documenting the changes made. Author: Dhruve Ashar <[email protected]> Closes apache#12333 from dhruve/doc/SPARK-14572.
#### What changes were proposed in this pull request? This PR is to provide a native DDL support for the following three Alter View commands: Based on the Hive DDL document: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL ##### 1. ALTER VIEW RENAME **Syntax:** ```SQL ALTER VIEW view_name RENAME TO new_view_name ``` - to change the name of a view to a different name - not allowed to rename a view's name by ALTER TABLE ##### 2. ALTER VIEW SET TBLPROPERTIES **Syntax:** ```SQL ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment); ``` - to add metadata to a view - not allowed to set views' properties by ALTER TABLE - ignore it if trying to set a view's existing property key when the value is the same - overwrite the value if trying to set a view's existing key to a different value ##### 3. ALTER VIEW UNSET TBLPROPERTIES **Syntax:** ```SQL ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key') ``` - to remove metadata from a view - not allowed to unset views' properties by ALTER TABLE - issue an exception if trying to unset a view's non-existent key #### How was this patch tested? Added test cases to verify if it works properly. Author: gatorsmile <[email protected]> Author: xiaoli <[email protected]> Author: Xiao Li <[email protected]> Closes apache#12324 from gatorsmile/alterView.
|
@yhuai Another question.. Currently, for the tables that have |
|
oh i see. How about we also show how to create the table in DataFrame API for now? |
## What changes were proposed in this pull request? I was trying to understand the accumulator and metrics update source code and these two classes don't really need to be case classes. It would also be more consistent with other UI classes if they are not case classes. This is part of my bigger effort to simplify accumulators and task metrics. ## How was this patch tested? This is a straightforward refactoring without behavior change. Author: Reynold Xin <[email protected]> Closes apache#12386 from rxin/SPARK-14625.
…t methods should have explicit return types ## What changes were proposed in this pull request? Currently many public abstract methods (in abstract classes as well as traits) don't declare return types explicitly, such as in [o.a.s.streaming.dstream.InputDStream](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/InputDStream.scala#L110): ```scala def start() // should be: def start(): Unit def stop() // should be: def stop(): Unit ``` These methods exist in core, sql, streaming; this PR fixes them. ## How was this patch tested? N/A ## Which piece of scala style rule led to the changes? the rule was added separately in apache#12396 Author: Liwei Lin <[email protected]> Closes apache#12389 from lw-lin/public-abstract-methods.
|
So we have 3 cases:
|
…d mllib-local into one place ## What changes were proposed in this pull request? Move json4s, breeze dependency declaration into parent ## How was this patch tested? Should be no functional change, but Jenkins tests will test that. Author: Sean Owen <[email protected]> Closes apache#12390 from srowen/SPARK-14612.
## What changes were proposed in this pull request? When there are multiple attempts for a stage, we currently only reset internal accumulator values if all the tasks are resubmitted. It would make more sense to reset the accumulator values for each stage attempt. This will allow us to eventually get rid of the internal flag in the Accumulator class. This is part of my bigger effort to simplify accumulators and task metrics. ## How was this patch tested? Covered by existing tests. Author: Reynold Xin <[email protected]> Closes apache#12378 from rxin/SPARK-14619.
## What changes were proposed in this pull request? This patch removes some of the deprecated APIs in TaskMetrics. This is part of my bigger effort to simplify accumulators and task metrics. ## How was this patch tested? N/A - only removals Author: Reynold Xin <[email protected]> Closes apache#12375 from rxin/SPARK-14617.
…s a REPL line object ## What changes were proposed in this pull request? When we clean a closure, if its outermost parent is not a closure, we won't clone and clean it as cloning user's objects is dangerous. However, if it's a REPL line object, which may carry a lot of unnecessary references(like hadoop conf, spark conf, etc.), we should clean it as it's not a user object. This PR improves the check for user's objects to exclude REPL line object. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#12327 from cloud-fan/closure.
…nal Tables #### What changes were proposed in this pull request? This PR is to add a test to ensure drop partitions of an external table will not delete data. cc yhuai andrewor14 #### How was this patch tested? N/A Author: gatorsmile <[email protected]> This patch had conflicts when merged, resolved by Committer: Andrew Or <[email protected]> Closes apache#12350 from gatorsmile/testDropPartition.
## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-14592 This patch adds native support for DDL command `CREATE TABLE LIKE`. The SQL syntax is like: CREATE TABLE table_name LIKE existing_table CREATE TABLE IF NOT EXISTS table_name LIKE existing_table ## How was this patch tested? `HiveDDLCommandSuite`. `HiveQuerySuite` already tests `CREATE TABLE LIKE`. Author: Liang-Chi Hsieh <[email protected]> This patch had conflicts when merged, resolved by Committer: Andrew Or <[email protected]> Closes apache#12362 from viirya/create-table-like.
Added binary toggle param to CountVectorizer feature transformer in PySpark. Created a unit test for using CountVectorizer with the binary toggle on. Author: Bryan Cutler <[email protected]> Closes apache#12308 from BryanCutler/binary-param-python-CountVectorizer-SPARK-13967.
## What changes were proposed in this pull request? In Spark 1.4, we negated some metrics from RegressionEvaluator since CrossValidator always maximized metrics. This was fixed in 1.5, but the docs were not updated. This PR updates the docs. ## How was this patch tested? no tests Author: Joseph K. Bradley <[email protected]> Closes apache#12377 from jkbradley/regeval-doc.
…HashingTF in ML & MLlib ## What changes were proposed in this pull request? This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1. Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done. ## How was this patch tested? This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib. Author: Yong Tang <[email protected]> Closes apache#12079 from yongtang/SPARK-14238.
## What changes were proposed in this pull request?
Current `LikeSimplification` handles the following four rules.
- 'a%' => expr.StartsWith("a")
- '%b' => expr.EndsWith("b")
- '%a%' => expr.Contains("a")
- 'a' => EqualTo("a")
This PR adds the following rule.
- 'a%b' => expr.Length() >= 2 && expr.StartsWith("a") && expr.EndsWith("b")
Here, 2 is statically calculated from "a".size + "b".size.
**Before**
```
scala> sql("select a from (select explode(array('abc','adc')) a) T where a like 'a%c'").explain()
== Physical Plan ==
WholeStageCodegen
: +- Filter a#5 LIKE a%c
: +- INPUT
+- Generate explode([abc,adc]), false, false, [a#5]
+- Scan OneRowRelation[]
```
**After**
```
scala> sql("select a from (select explode(array('abc','adc')) a) T where a like 'a%c'").explain()
== Physical Plan ==
WholeStageCodegen
: +- Filter ((length(a#5) >= 2) && (StartsWith(a#5, a) && EndsWith(a#5, c)))
: +- INPUT
+- Generate explode([abc,adc]), false, false, [a#5]
+- Scan OneRowRelation[]
```
## How was this patch tested?
Pass the Jenkins tests (including new testcase).
Author: Dongjoon Hyun <[email protected]>
Closes apache#12312 from dongjoon-hyun/SPARK-14545.
|
Somehow my merge brought in a lot of other stuff in this PR. I need to close this PR and resubmit another PR. |
|
Test build #55887 has finished for PR 12132 at commit
|
|
A new PR has been created. #12406 All the comments are addressed there. Thanks! |
What changes were proposed in this pull request?
SHOW CREATE TABLE tableIdentifierin SqlBase.g4 fileRunnableCommandclass forShowCreateTableCommand, that invokesSessionCatalog.generateTableDDLgenerateTableDDLmethod inSessionCatalogand override it inHiveSessionCatalog.HiveSessionCatalog.generateTableDDLis to create DDL for metastore tables. and for tempTables in Spark, it may not be quite practical. For now, we throw exception inSessionCatalog.generateTableDDL. If the community thinks it is also necessary, we can implement this later.HiveSessionCatalog.generateTableDDLbranches out to generate Hive syntax DDL for hive tables, while it is also supposed to branch out to generate DataSource syntax DDL for datasource tables.If the table is created as
df.write.partitionBy("a").saveAsTable("t1"), this datasource table will record the partitioning columns in tblproperties of the metastore table. But, when the datasource syntax DDL is generated based on this table, we will lose the partitioning information in the DDL. Similarly for bucketing and skew information. etc.How was this patch tested?
unite tests with self-check. run the hive, sql, catalyst sbt tests.