-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14346] [SQL] Show create table #12406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
@yhuai @andrewor14 I need to resubmit this PR and closed the earlier one because somehow my merge brought in a lot of other PRs' stuff, which make the other PR un-reviewable. |
…k assembly ## What changes were proposed in this pull request? Removing references to assembly jar in documentation. Adding an additional (previously undocumented) usage of spark-submit to run examples. ## How was this patch tested? Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit. Author: Mark Grover <[email protected]> Closes apache#12365 from markgrover/spark-14601.
…dHashMap ## What changes were proposed in this pull request? This patch speeds up group-by aggregates by around 3-5x by leveraging an in-memory `AggregateHashMap` (please see apache#12161), an append-only aggregate hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the `BytesToBytesMap` if a given key isn't found). Architecturally, it is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the key-value pairs. The index lookups in the array rely on linear probing (with a small number of maximum tries) and use an inexpensive hash function which makes it really efficient for a majority of lookups. However, using linear probing and an inexpensive hash function also makes it less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even for certain distribution of keys) and requires us to fall back on the latter for correctness. ## How was this patch tested? Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4 Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Aggregate w keys: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- codegen = F 2124 / 2204 9.9 101.3 1.0X codegen = T hashmap = F 1198 / 1364 17.5 57.1 1.8X codegen = T hashmap = T 369 / 600 56.8 17.6 5.8X Author: Sameer Agarwal <[email protected]> Closes apache#12345 from sameeragarwal/tungsten-aggregate-integration.
…eAggregate ## What changes were proposed in this pull request? `ExpressionEncoder` is just a container for serialization and deserialization expressions, we can use these expressions to build `TypedAggregateExpression` directly, so that it can fit in `DeclarativeAggregate`, which is more efficient. One trick is, for each buffer serializer expression, it will reference to the result object of serialization and function call. To avoid re-calculating this result object, we can serialize the buffer object to a single struct field, so that we can use a special `Expression` to only evaluate result object once. ## How was this patch tested? existing tests Author: Wenchen Fan <[email protected]> Closes apache#12067 from cloud-fan/typed_udaf.
…t export/import ## What changes were proposed in this pull request? PySpark ml GBTClassifier, Regressor support export/import. ## How was this patch tested? Doc test. cc jkbradley Author: Yanbo Liang <[email protected]> Closes apache#12383 from yanboliang/spark-14374.
Closes apache#12408 Closes apache#12401
… in mllib-local
## What changes were proposed in this pull request?
This task will copy the Vector and Matrix classes from mllib to ml package in mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will be replaced by /* since 1.2.0 */
The BLAS implementation will be copied, and some of the test utilities will be copies as well.
Summary of changes:
1. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/BLAS.scala
- Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/BLAS.scala
- logDebug("gemm: alpha is equal to 0 and beta is equal to 1. Returning C.") is removed in ml version.
2. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Matrices.scala
- Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Matrices.scala
- `Since` was removed, and we'll use standard `/* Since /*` Java doc. Will be in another PR.
- `UDT` related code was removed, and will use `SPARK-13944` apache#12259 to replace the annotation.
3. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Vectors.scala
- Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
- `Since` was removed.
- `UDT` related code was removed.
- In `def parseNumeric`, it was throwing `throw new SparkException(s"Cannot parse $other.")`, and now it's throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
4. In mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
- For consistency with ML version of vector, `def parseNumeric` is now throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
5. mllib/src/main/scala/org/apache/spark/**mllib**/util/NumericParser.scala is moved to mllib-local/src/main/scala/org/apache/spark/**ml**/util/NumericParser.scala
- All the `throw new SparkException` were replaced by `throw new IllegalArgumentException`
## How was this patch tested?
unit tests
Author: DB Tsai <[email protected]>
Closes apache#12317 from dbtsai/dbtsai-ml-vector.
| * SHOW CREATE TABLE tableIdentifier | ||
| * }}} | ||
| */ | ||
| case class ShowCreateTableCommand( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just pass in a TableIdentifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell Thanks!. Yes, I should have. Will change it.
…Optimizer ## What changes were proposed in this pull request? Removed duplicated generation of `ids` in OnlineLDAOptimizer. ## How was this patch tested? tested with existing unit tests. Author: Pravin Gadakh <[email protected]> Closes apache#12176 from pravingadakh/SPARK-14370.
…Message ## What changes were proposed in this pull request? Round memory bytes and convert it to Long to it’s original type. This change fixes the formatting issue in the Exception message. ## How was this patch tested? Manual tests were done in CDH cluster. Author: Peter Ableda <[email protected]> Closes apache#12392 from peterableda/SPARK-14633.
…:glm for more family and link functions ## What changes were proposed in this pull request? Expose R-like summary statistics in SparkR::glm for more family and link functions. Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work. ## How was this patch tested? Unit tests. SparkR Output: ``` Deviance Residuals: (Note: These are approximate quantiles with relative error <= 0.01) Min 1Q Median 3Q Max -0.95096 -0.16585 -0.00232 0.17410 0.72918 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.6765 0.23536 7.1231 4.4561e-11 Sepal_Length 0.34988 0.046301 7.5566 4.1873e-12 Species_versicolor -0.98339 0.072075 -13.644 0 Species_virginica -1.0075 0.093306 -10.798 0 (Dispersion parameter for gaussian family taken to be 0.08351462) Null deviance: 28.307 on 149 degrees of freedom Residual deviance: 12.193 on 146 degrees of freedom AIC: 59.22 Number of Fisher Scoring iterations: 1 ``` R output: ``` Deviance Residuals: Min 1Q Median 3Q Max -0.95096 -0.16522 0.00171 0.18416 0.72918 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.67650 0.23536 7.123 4.46e-11 *** Sepal.Length 0.34988 0.04630 7.557 4.19e-12 *** Speciesversicolor -0.98339 0.07207 -13.644 < 2e-16 *** Speciesvirginica -1.00751 0.09331 -10.798 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 0.08351462) Null deviance: 28.307 on 149 degrees of freedom Residual deviance: 12.193 on 146 degrees of freedom AIC: 59.217 Number of Fisher Scoring iterations: 2 ``` cc mengxr Author: Yanbo Liang <[email protected]> Closes apache#12393 from yanboliang/spark-13925.
…pwords ## What changes were proposed in this pull request? The default stopwords were a Java object. They are no longer. ## How was this patch tested? Unit test which failed before the fix Author: Joseph K. Bradley <[email protected]> Closes apache#12422 from jkbradley/pyspark-stopwords.
…set` method ## What changes were proposed in this pull request? Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens. Additional changes: * [SPARK-13068](apache#11663) missed adding type converters in evaluation.py so those are done here * An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here. ## How was this patch tested? Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR. Author: sethah <[email protected]> Closes apache#11939 from sethah/SPARK-14104.
…ation ## What changes were proposed in this pull request? Change SubquerySuite to validate test results utilizing checkAnswer helper method ## How was this patch tested? Existing tests Author: Luciano Resende <[email protected]> Closes apache#12269 from lresende/SPARK-13419.
## What changes were proposed in this pull request? Before this PR, we create accumulators at driver side(and register them) and send them to executor side, then we create `TaskMetrics` with these accumulators at executor side. After this PR, we will create `TaskMetrics` at driver side and send it to executor side, so that we can create accumulators inside `TaskMetrics` directly, which is cleaner. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#12472 from cloud-fan/acc.
## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14600 This PR makes `Expand.output` have different attributes from the grouping attributes produced by the underlying `Project`, as they have different meaning, so that we can safely push down filter through `Expand` ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#12496 from cloud-fan/expand.
…xpressionToSQLSuite ## What changes were proposed in this pull request? Since [SPARK-12719: SQL Generation supports for generators](https://issues.apache.org/jira/browse/SPARK-12719) was resolved, this PR enables the related testcases: `explode()` and `json_tuple()`. ## How was this patch tested? Pass the Jenkins tests (with re-enabled test cases). Author: Dongjoon Hyun <[email protected]> Closes apache#12329 from dongjoon-hyun/minor_enable_testcases.
## What changes were proposed in this pull request?
This issue aims to expose Scala `bround` function in Python/R API.
`bround` function is implemented in SPARK-14614 by extending current `round` function.
We used the following semantics from Hive.
```java
public static double bround(double input, int scale) {
if (Double.isNaN(input) || Double.isInfinite(input)) {
return input;
}
return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue();
}
```
After this PR, `pyspark` and `sparkR` also support `bround` function.
**PySpark**
```python
>>> from pyspark.sql.functions import bround
>>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect()
[Row(r=2.0)]
```
**SparkR**
```r
> df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5)))
> head(collect(select(df, bround(df$x, 0))))
bround(x, 0)
1 2
2 4
```
## How was this patch tested?
Pass the Jenkins tests (including new testcases).
Author: Dongjoon Hyun <[email protected]>
Closes apache#12509 from dongjoon-hyun/SPARK-14639.
…rn a function `MutableProjection` is not thread-safe and we won't use it in multiple threads. I think the reason that we return `() => MutableProjection` is not about thread safety, but to save the costs of generating code when we need same but individual mutable projections. However, I only found one place that use this [feature](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala#L122-L123), and comparing to the troubles it brings, I think we should generate `MutableProjection` directly instead of return a function. Author: Wenchen Fan <[email protected]> Closes apache#7373 from cloud-fan/project.
## What changes were proposed in this pull request? The DAG visualization can cause an OOM when generating the DOT file. This happens because clusters are not correctly deduped by a contains check because they use the default equals implementation. This adds a working equals implementation. ## How was this patch tested? This adds a test suite that checks the new equals implementation. Author: Ryan Blue <[email protected]> Closes apache#12437 from rdblue/SPARK-14679-fix-ui-oom.
… of call FileSystem.get(conf) ## What changes were proposed in this pull request? - replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)` ## How was this patch tested? N/A Author: Liwei Lin <[email protected]> Closes apache#12450 from lw-lin/fix-fs-get.
… HashingTF ## What changes were proposed in this pull request? Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this. ## How was this patch tested? unit tests and doc generation Author: Yuhao Yang <[email protected]> Closes apache#12454 from hhbyyh/tfdoc.
…page Updated the log page by replacing the current pagination with a javascript-based infinite scroll solution Author: Alex Bozarth <[email protected]> Closes apache#10910 from ajbozarth/spark8171.
## What changes were proposed in this pull request?
This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
- ContinuousQuery
- Trigger
- ProcessingTime
in pyspark under `pyspark.sql.streaming`.
In addition, it contains the new methods added under:
- `DataFrameWriter`
a) `startStream`
b) `trigger`
c) `queryName`
- `DataFrameReader`
a) `stream`
- `DataFrame`
a) `isStreaming`
This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
- `exception`
- `sourceStatuses`
- `sinkStatus`
They may be added in a follow up.
This PR also contains some very minor doc fixes in the Scala side.
## How was this patch tested?
Python doc tests
TODO:
- [ ] verify Python docs look good
Author: Burak Yavuz <[email protected]>
Author: Burak Yavuz <[email protected]>
Closes apache#12320 from brkyvz/stream-python.
## What changes were proposed in this pull request? Restore `ec2-scripts.md` as a redirect to amplab/spark-ec2 docs ## How was this patch tested? `jekyll build` and checked with the browser Author: Sean Owen <[email protected]> Closes apache#12534 from srowen/SPARK-14742.
## What changes were proposed in this pull request? This proposal removes the class `HttpServer`, with the changing of internal file/jar/class transmission to RPC layer, currently there's no code using this `HttpServer`, so here propose to remove it. ## How was this patch tested? Unit test is verified locally. Author: jerryshao <[email protected]> Closes apache#12526 from jerryshao/SPARK-14725.
… method ## What changes were proposed in this pull request? apache#11939 make Python param setters use the `_set` method. This PR fix omissive ones. ## How was this patch tested? Existing tests. cc jkbradley sethah Author: Yanbo Liang <[email protected]> Closes apache#12531 from yanboliang/setters-omissive.
…ted sample std ## What changes were proposed in this pull request? Currently, MLlib's StandardScaler scales columns using the corrected standard deviation (sqrt of unbiased variance). This matches what R's scale package does. This PR documents this fact. ## How was this patch tested? doc only Author: Joseph K. Bradley <[email protected]> Closes apache#12519 from jkbradley/scaler-variance-doc.
…artitioned directory ## What changes were proposed in this pull request? Consider the following directory structure dir/col=X/some-files If we create a text format streaming dataframe on `dir/col=X/` then it should not consider as partitioning in columns. Even though the streaming dataframe does not do so, the generated batch dataframes pick up col as a partitioning columns, causing mismatch streaming source schema and generated df schema. This leads to runtime failure: ``` 18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 terminated with error java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8 ``` The reason is that the partition inferring code has no idea of a base path, above which it should not search of partitions. This PR makes sure that the batch DF is generated with the basePath set as the original path on which the file stream source is defined. ## How was this patch tested? New unit test Author: Tathagata Das <[email protected]> Closes apache#12517 from tdas/SPARK-14741.
…nState and Create a SparkSession class ## What changes were proposed in this pull request? This PR has two main changes. 1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext. 2. Create a SparkSession Class, which will later be the entry point of Spark SQL users. ## How was this patch tested? Existing tests This PR is trying to fix test failures of apache#12485. Author: Andrew Or <[email protected]> Author: Yin Huai <[email protected]> Closes apache#12522 from yhuai/spark-session.
## What changes were proposed in this pull request? apache#11663 adds type conversion functionality for parameters in Pyspark. This PR find out the omissive ```Param``` that did not pass corresponding ```TypeConverter``` argument and fix them. After this PR, all params in pyspark/ml/ used ```TypeConverter```. ## How was this patch tested? Existing tests. cc jkbradley sethah Author: Yanbo Liang <[email protected]> Closes apache#12529 from yanboliang/typeConverter.
…action ## What changes were proposed in this pull request? This PR adds a special log for FileStreamSink for two purposes: - Versioning. A future Spark version should be able to read the metadata of an old FileStreamSink. - Compaction. As reading from many small files is usually pretty slow, we should compact small metadata files into big files. FileStreamSinkLog has a new log format instead of Java serialization format. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of FileLog. FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compact, it will read all history logs and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by FileLog.action). When the reader uses allLogs to list all files, this method only returns the visible files (drops the deleted files). ## How was this patch tested? FileStreamSinkLogSuite Author: Shixiong Zhu <[email protected]> Closes apache#12435 from zsxwing/sink-log.
…ments
## What changes were proposed in this pull request?
Expand the possible ways to interact with the contents of a `pyspark.sql.types.StructType` instance.
- Iterating a `StructType` will iterate its fields
- `[field.name for field in my_structtype]`
- Indexing with a string will return a field by name
- `my_structtype['my_field_name']`
- Indexing with an integer will return a field by position
- `my_structtype[0]`
- Indexing with a slice will return a new `StructType` with just the chosen fields:
- `my_structtype[1:3]`
- The length is the number of fields (should also provide "truthiness" for free)
- `len(my_structtype) == 2`
## How was this patch tested?
Extended the unit test coverage in the accompanying `tests.py`.
Author: Sheamus K. Parkes <[email protected]>
Closes apache#12251 from skparkes/pyspark-structtype-enhance.
## What changes were proposed in this pull request? 3 testcases namely, ``` "count is partially aggregated" "count distinct is partially aggregated" "mixed aggregates are partially aggregated" ``` were failing when running PlannerSuite individually. The PR provides a fix for this. ## How was this patch tested? unit tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Subhobrata Dey <[email protected]> Closes apache#12532 from sbcd90/plannersuitetestsfix.
|
@xwu0226 use git rebase upstream/master. Do not use git merge upstream/master. I have the same issue before. git merge will add others' commits to your PR. git rebase will discard others' commits. |
|
Thanks @wangmiao1981 . This happens again. all the commits after my last commits got pulled into this PR. I need to close it and open a new PR. Will submit a new PR. |
What changes were proposed in this pull request?
Allow users to issue "
SHOW CREATE TABLE" command natively in SparkSQL.-- For tables that are created by Hive, this command will display the DDL in hive syntax. If the syntax includes
CLUSTERED BY, SKEWED BY or STORED BYclause, there will be a warning message saying that this DDL is not supported in SparkSQL native DDL yet.-- For tables that are created by datasource DDL, such as "
CREATE TABLE... USING ... OPTIONS (...)", it will show the DDL in this syntax.-- For tables that are created by dataframe API, such as "
df.write.partitionBy(...).saveAsTable(...)", currently the command will display DDL with the syntax "CREATE TABLE.. USING...OPTIONS(...)". However, this syntax lose the partitioning information. It is proposed to display create table in the dataframe API format (Will make this in another PR )How was this patch tested?
Unit tests are created.