merge from apache master by shanyu · Pull Request #1 · shanyu/spark

shanyu · 2019-09-06T00:35:18Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

…ExtractPythonUDFFromJoinCondition and move to 'Extract Python UDFs' ## What changes were proposed in this pull request? This PR targets to rename `PullOutPythonUDFInJoinCondition` to `ExtractPythonUDFFromJoinCondition` and move to 'Extract Python UDFs' together with other Python UDF related rules. Currently `PullOutPythonUDFInJoinCondition` rule is alone outside of other 'Extract Python UDFs' rules together. and the name `ExtractPythonUDFFromJoinCondition` is matched to existing Python UDF extraction rules. ## How was this patch tested? Existing tests should cover. Closes #25358 from HyukjinKwon/move-python-join-rule. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

## What changes were proposed in this pull request? Remove the redundant and confusing transformImpl method in RF & GBT; 1, In `GBTClassifier` & `RandomForestClassifier`, the real `transform` methods inherit from `ProbabilisticClassificationModel` which can deal with multi output columns. The `transformImpl` method, which deals with only one column - `predictionCol`, completely does nothing. This is quite confusing. 2, In `GBTRegressor` & `RandomForestRegressor`, the `transformImpl` do exactly what the superclass `PredictionModel` does (except model broadcasting), so can be removed. ## How was this patch tested? existing suites Closes #25256 from zhengruifeng/del_ensamble_transformImpl. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

## What changes were proposed in this pull request? In the PR, I propose to use existing expressions `DayOfYear`, `WeekDay` and `DayOfWeek`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT): 1. `dow` - the day of the week as Sunday (0) to Saturday (6) 2. `isodow` - the day of the week as Monday (1) to Sunday (7) 3. `doy` - the day of the year (1 - 365/366) Here are examples: ```sql spark-sql> SELECT EXTRACT(DOW FROM TIMESTAMP '2001-02-16 20:38:40'); 5 spark-sql> SELECT EXTRACT(ISODOW FROM TIMESTAMP '2001-02-18 20:38:40'); 7 spark-sql> SELECT EXTRACT(DOY FROM TIMESTAMP '2001-02-16 20:38:40'); 47 ``` ## How was this patch tested? Updated `extract.sql`. Closes #25367 from MaxGekk/extract-ext. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

## What changes were proposed in this pull request? Add's the higher order function `forall`, which tests an array to see if a predicate holds for every element. The function is implemented in `org.apache.spark.sql.catalyst.expressions.ArrayForAll`. The function is added to the function registry under the pretty name `forall`. ## How was this patch tested? I've added appropriate unit tests for the new ArrayForAll expression in `sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala`. Also added tests for the function in `sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala`. Not sure who is best to ask about this PR so: HyukjinKwon rxin gatorsmile ueshin srowen hvanhovell gatorsmile Closes #24761 from nvander1/feature/for_all. Lead-authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com> Co-authored-by: Nik <nikolasrvanderhoof@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

…say column signature function is by default ## What changes were proposed in this pull request? Add a guide line for dataframe functions, say: ``` This function APIs usually have methods with Column signature only because it can support not only Column but also other types such as a native string. The other variants currently exist for historical reasons. ``` ## How was this patch tested? N/A Closes #25355 from WeichenXu123/update_functions_guide2. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

## What changes were proposed in this pull request? Implements the `DESCRIBE TABLE` logical and physical plans for data source v2 tables. ## How was this patch tested? Added unit tests to `DataSourceV2SQLSuite`. Closes #25040 from mccheah/describe-table-v2. Authored-by: mcheah <mcheah@palantir.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ests ## What changes were proposed in this pull request? This PR is a follow-up to #25074 ## How was this patch tested? Pass the Jenkins with the newly update test files. Closes #25366 from beliefer/uncomment-boolean-test. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

## What changes were proposed in this pull request? Sometimes when you explain a query, you will get stuck for a while. What's worse, you will get stuck again if you explain again. This is caused by `FileSourceScanExec`: 1. In its `toString`, it needs to report the number of partitions it reads. This needs to query the hive metastore. 2. In its `outputOrdering`, it needs to get all the files. This needs to query the hive metastore. This PR fixes by: 1. `toString` do not need to report the number of partitions it reads. We should report it via SQL metrics. 2. The `outputOrdering` is not very useful. We can only apply it if a) all the bucket columns are read. b) there is only one file in each bucket. This condition is really hard to meet, and even if we meet, sorting an already sorted file is pretty fast and avoiding the sort is not that useful. I think it's worth to give up this optimization so that explain don't need to get stuck. ## How was this patch tested? existing tests Closes #25328 from cloud-fan/ui. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? This pr adds a test(`SparkThriftServerProtocolVersionsSuite`) to test different versions of the thrift protocol because we use different logic to handle the `RowSet`: https://github.com/apache/spark/blob/02c33694c8254f69cb36c71c0876194dccdbc014/sql/hive-thriftserver/v1.2.1/src/main/java/org/apache/hive/service/cli/RowSetFactory.java#L28-L40 When adding this test cases, found three bugs: [SPARK-26969](https://issues.apache.org/jira/browse/SPARK-26969): Using ODBC not able to see the data in table when datatype is decimal [SPARK-28463](https://issues.apache.org/jira/browse/SPARK-28463): Thriftserver throws BigDecimal incompatible with HiveDecimal [SPARK-28474](https://issues.apache.org/jira/browse/SPARK-28474): Lower JDBC client version(Hive 0.12) cannot read binary type ## How was this patch tested? N/A Closes #25228 from wangyum/SPARK-28472. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…ive Query Execution ## What changes were proposed in this pull request? Subqueries do not have their own execution id, thus when calling `AdaptiveSparkPlanExec.onUpdatePlan`, it will actually get the `QueryExecution` instance of the main query, which is wasteful and problematic. It could cause issues like stack overflow or dead locks in some circumstances. This PR fixes this issue by making `AdaptiveSparkPlanExec` compare the `QueryExecution` object retrieved by current execution ID against the `QueryExecution` object from which this plan is created, and only update the UI when the two instances are the same. ## How was this patch tested? Manual tests on TPC-DS queries. Closes #25316 from maryannxue/aqe-updateplan-fix. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: herman <herman@databricks.com>

…flow ## What changes were proposed in this pull request? The flag `spark.sql.decimalOperations.nullOnOverflow` is not honored by the `Cast` operator. This means that a casting which causes an overflow currently returns `null`. The PR makes `Cast` respecting that flag, ie. when it is turned to false and a decimal overflow occurs, an exception id thrown. ## How was this patch tested? Added UT Closes #25253 from mgaido91/SPARK-28470. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…talogs ## What changes were proposed in this pull request? In `Catalogs.load`, the `pluginClassName` in the following code ``` String pluginClassName = conf.getConfString("spark.sql.catalog." + name, null); ``` is always null for built-in catalogs, e.g there is a SQLConf entry `spark.sql.catalog.session`. This is because of #18852: SQLConf.conf.getConfString(key, null) always returns null. ## How was this patch tested? Apply code changes of #24768 and tried loading session catalog. Closes #25094 from gengliangwang/fixCatalogLoad. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>

…of the query ## What changes were proposed in this pull request? This PR fixes the issue of misplacement when the comment at the end of the query. Example: Comment for ` SELECT date '5874898-01-01'`: https://github.com/apache/spark/blob/2d74f14d74e7b24109f347822600ebf9819b04c1/sql/core/src/test/resources/sql-tests/inputs/pgSQL/date.sql#L200 But the golden file is: https://github.com/apache/spark/blob/a5a5da78cfbb1b439c139fd2ffd5744fc60ebafe/sql/core/src/test/resources/sql-tests/results/pgSQL/date.sql.out#L484-L507 After this PR: https://github.com/apache/spark/blob/eeb7405ad0c7cc1004e2cad36929d20d95ab2726/sql/core/src/test/resources/sql-tests/results/pgSQL/date.sql.out#L482-L501 ## How was this patch tested? N/A Closes #25357 from wangyum/SPARK-28617. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

## What changes were proposed in this pull request? If you build Spark distributions you potentially end up with a `python/.eggs` directory in your working copy which is not currently ignored by Spark's `.gitignore` file. Since these are transient build artifacts there is no reason to ever commit these to Git so this should be placed in the `.gitignore` list ## How was this patch tested? Verified the offending artifacts were no longer reported as untracked content by Git Closes #25380 from rvesse/patch-1. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ivision internal ## What changes were proposed in this pull request? This PR makes `spark.sql.function.preferIntegralDivision` to internal configuration because it is only used for PostgreSQL test cases. More details: #25158 (comment) ## How was this patch tested? N/A Closes #25376 from wangyum/SPARK-28395-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

## What changes were proposed in this pull request? Fix the typo in java doc. ## How was this patch tested? N/A Signed-off-by: Yishuang Lu <luystugmail.com> Closes #25377 from lys0716/dev. Authored-by: Yishuang Lu <luystu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ema=True)` ## What changes were proposed in this pull request? Add missing validation for `LongType` in `pyspark.sql.types._make_type_verifier`. ## How was this patch tested? Doctests / unittests / manual tests. Unpatched version: ``` In [23]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() Out[23]: [Row(x=None)] ``` Patched: ``` In [5]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-5-c1740fcadbf9> in <module> ----> 1 s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema) 689 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) 690 else: --> 691 rdd, schema = self._createFromLocal(map(prepare, data), schema) 692 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) 693 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in _createFromLocal(self, data, schema) 405 # make sure data could consumed multiple times 406 if not isinstance(data, list): --> 407 data = list(data) 408 409 if schema is None or isinstance(schema, (list, tuple)): /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in prepare(obj) 671 672 def prepare(obj): --> 673 verify_func(obj) 674 return obj 675 elif isinstance(schema, DataType): /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj) 1427 def verify(obj): 1428 if not verify_nullability(obj): -> 1429 verify_value(obj) 1430 1431 return verify /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_struct(obj) 1397 if isinstance(obj, dict): 1398 for f, verifier in verifiers: -> 1399 verifier(obj.get(f)) 1400 elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): 1401 # the order in obj could be different than dataType.fields /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj) 1427 def verify(obj): 1428 if not verify_nullability(obj): -> 1429 verify_value(obj) 1430 1431 return verify /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_long(obj) 1356 if obj < -9223372036854775808 or obj > 9223372036854775807: 1357 raise ValueError( -> 1358 new_msg("object of LongType out of range, got: %s" % obj)) 1359 1360 verify_value = verify_long ValueError: field x: object of LongType out of range, got: 18446744073709551616 ``` Closes #25117 from simplylizz/master. Authored-by: Anton Yanchenko <simplylizz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

## What changes were proposed in this pull request? This PR fix Hive 0.12 JDBC client can not handle binary type: ```sql Connected to: Hive (version 3.0.0-SNAPSHOT) Driver: Hive (version 0.12.0) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 0.12.0 by Apache Hive 0: jdbc:hive2://localhost:10000> SELECT cast('ABC' as binary); Error: java.lang.ClassCastException: [B incompatible with java.lang.String (state=,code=0) ``` Server log: ``` 19/08/07 10:10:04 WARN ThriftCLIService: Error fetching results: java.lang.RuntimeException: java.lang.ClassCastException: [B incompatible with java.lang.String at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) at java.security.AccessController.doPrivileged(AccessController.java:770) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) at com.sun.proxy.$Proxy26.fetchResults(Unknown Source) at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:455) at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:819) Caused by: java.lang.ClassCastException: [B incompatible with java.lang.String at org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:198) at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$getNextRowSet$1(SparkExecuteStatementOperation.scala:151) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$Lambda$1923.000000009113BFE0.apply(Unknown Source) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withSchedulerPool(SparkExecuteStatementOperation.scala:299) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:113) at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220) at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:785) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) ... 18 more ``` ## How was this patch tested? unit tests Closes #25379 from wangyum/SPARK-28474. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…TYPE ## What changes were proposed in this pull request? This PR port [HIVE-10646](https://issues.apache.org/jira/browse/HIVE-10646) to fix Hive 0.12's JDBC client can not handle `NULL_TYPE`: ```sql Connected to: Hive (version 3.0.0-SNAPSHOT) Driver: Hive (version 0.12.0) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 0.12.0 by Apache Hive 0: jdbc:hive2://localhost:10000> select null; org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:346) at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:423) at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:405) ``` Server log: ``` 19/08/07 09:34:07 ERROR TThreadPoolServer: Error occurred during processing of message. java.lang.NullPointerException at org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:388) at org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:338) at org.apache.hive.service.cli.thrift.TRow.write(TRow.java:288) at org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:605) at org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:525) at org.apache.hive.service.cli.thrift.TRowSet.write(TRowSet.java:455) at org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:550) at org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:486) at org.apache.hive.service.cli.thrift.TFetchResultsResp.write(TFetchResultsResp.java:412) at org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.write(TCLIService.java:13192) at org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.write(TCLIService.java:13156) at org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result.write(TCLIService.java:13107) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:58) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:819) ``` ## How was this patch tested? unit tests Closes #25378 from wangyum/SPARK-28644. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

## What changes were proposed in this pull request? Plans after "Extract Python UDFs" are very flaky and error-prone to other rules. For instance, if we add some rules, for instance, `PushDownPredicates` in `postHocOptimizationBatches`, the test in `BatchEvalPythonExecSuite` fails: ```scala test("Python UDF refers to the attributes from more than one child") { val df = Seq(("Hello", 4)).toDF("a", "b") val df2 = Seq(("Hello", 4)).toDF("c", "d") val joinDF = df.crossJoin(df2).where("dummyPythonUDF(a, c) == dummyPythonUDF(d, c)") val qualifiedPlanNodes = joinDF.queryExecution.executedPlan.collect { case b: BatchEvalPythonExec => b } assert(qualifiedPlanNodes.size == 1) } ``` ``` Invalid PythonUDF dummyUDF(a#63, c#74), requires attributes from more than one child. ``` This is because Python UDF extraction optimization is rolled back as below: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicates === !Filter (dummyUDF(a#7, c#18) = dummyUDF(d#19, c#18)) Join Cross, (dummyUDF(a#7, c#18) = dummyUDF(d#19, c#18)) !+- Join Cross :- Project [_1#2 AS a#7, _2#3 AS b#8] ! :- Project [_1#2 AS a#7, _2#3 AS b#8] : +- LocalRelation [_1#2, _2#3] ! : +- LocalRelation [_1#2, _2#3] +- Project [_1#13 AS c#18, _2#14 AS d#19] ! +- Project [_1#13 AS c#18, _2#14 AS d#19] +- LocalRelation [_1#13, _2#14] ! +- LocalRelation [_1#13, _2#14] ``` Seems we should do Python UDFs cases at the last even after post hoc rules. Note that this actually rather follows the way in previous versions when those were in physical plans (see SPARK-24721 and SPARK-12981). Those optimization rules were supposed to be placed at the end. Note that I intentionally didn't move `ExperimentalMethods` (`spark.experimental.extraStrategies`). This is an explicit experimental API and I wanted to just-in-case workaround after this change for now. ## How was this patch tested? Existing tests should cover. Closes #25386 from HyukjinKwon/SPARK-28654. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? [SPARK-17783](https://issues.apache.org/jira/browse/SPARK-17783) hided Credentials in `CREATE` and `DESC FORMATTED/EXTENDED` a PERSISTENT/TEMP Table for JDBC. But `SHOW CREATE TABLE` exposed the credentials: ```sql spark-sql> show create table mysql_federated_sample; CREATE TABLE `mysql_federated_sample` (`TBL_ID` BIGINT, `CREATE_TIME` INT, `DB_ID` BIGINT, `LAST_ACCESS_TIME` INT, `OWNER` STRING, `RETENTION` INT, `SD_ID` BIGINT, `TBL_NAME` STRING, `TBL_TYPE` STRING, `VIEW_EXPANDED_TEXT` STRING, `VIEW_ORIGINAL_TEXT` STRING, `IS_REWRITE_ENABLED` BOOLEAN) USING org.apache.spark.sql.jdbc OPTIONS ( `url` 'jdbc:mysql://localhost/hive?user=root&password=mypasswd', `driver` 'com.mysql.jdbc.Driver', `dbtable` 'TBLS' ) ``` This pr fix this issue. ## How was this patch tested? unit tests and manual tests: ```sql spark-sql> show create table mysql_federated_sample; CREATE TABLE `mysql_federated_sample` (`TBL_ID` BIGINT, `CREATE_TIME` INT, `DB_ID` BIGINT, `LAST_ACCESS_TIME` INT, `OWNER` STRING, `RETENTION` INT, `SD_ID` BIGINT, `TBL_NAME` STRING, `TBL_TYPE` STRING, `VIEW_EXPANDED_TEXT` STRING, `VIEW_ORIGINAL_TEXT` STRING, `IS_REWRITE_ENABLED` BOOLEAN) USING org.apache.spark.sql.jdbc OPTIONS ( `url` '*********(redacted)', `driver` 'com.mysql.jdbc.Driver', `dbtable` 'TBLS' ) ``` Closes #25375 from wangyum/SPARK-28642. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

## What changes were proposed in this pull request? This PR is to port AGGREGATES.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/aggregates.sql#L607-L997 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/aggregates.out#L1615-L2289 When porting the test cases, found five PostgreSQL specific features that do not exist in Spark SQL: [SPARK-27980](https://issues.apache.org/jira/browse/SPARK-27980): Ordered-Set Aggregate Functions [SPARK-28661](https://issues.apache.org/jira/browse/SPARK-28661): Hypothetical-Set Aggregate Functions [SPARK-28382](https://issues.apache.org/jira/browse/SPARK-28382): Array Functions: unnest [SPARK-28663](https://issues.apache.org/jira/browse/SPARK-28663): Aggregate Functions for Statistics [SPARK-28664](https://issues.apache.org/jira/browse/SPARK-28664): ORDER BY in aggregate function [SPARK-28669](https://issues.apache.org/jira/browse/SPARK-28669): System Information Functions ## How was this patch tested? N/A Closes #25392 from wangyum/SPARK-28660. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…paths ## What changes were proposed in this pull request? Adds checks around: - The existence of transforms in the table schema (even in nested fields) - Duplications of transforms - Case sensitivity checks around column names in the V2 table creation code paths. ## How was this patch tested? Unit tests. Closes #25305 from brkyvz/v2CreateTable. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…RUNC/TRUNC ## What changes were proposed in this pull request? I propose new levels of truncations for the `date_trunc()` and `trunc()` functions: 1. `MICROSECOND` and `MILLISECOND` truncate values of the `TIMESTAMP` type to microsecond and millisecond precision. 2. `DECADE`, `CENTURY` and `MILLENNIUM` truncate dates/timestamps to lowest date of current decade/century/millennium. Also the `WEEK` and `QUARTER` levels have been supported by the `trunc()` function. The function is implemented similarly to `date_trunc` in PostgreSQL: https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC to maintain feature parity with it. Here are examples of `TRUNC`: ```sql spark-sql> SELECT TRUNC('2015-10-27', 'DECADE'); 2010-01-01 spark-sql> set spark.sql.datetime.java8API.enabled=true; spark.sql.datetime.java8API.enabled true spark-sql> SELECT TRUNC('1999-10-27', 'millennium'); 1001-01-01 ``` Examples of `DATE_TRUNC`: ```sql spark-sql> SELECT DATE_TRUNC('CENTURY', '2015-03-05T09:32:05.123456'); 2001-01-01T00:00:00Z ``` ## How was this patch tested? Added new tests to `DateTimeUtilsSuite`, `DateExpressionsSuite` and `DateFunctionsSuite`, and uncommented existing tests in `pgSQL/date.sql`. Closes #25336 from MaxGekk/date_truct-ext. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? Adds support for V2 catalogs and the V2SessionCatalog for V2 tables for saveAsTable. If the table can resolve through the V2SessionCatalog, we use SaveMode for datasource v1 for backwards compatibility to select the code path we're going to hit. Depending on the SaveMode: - SaveMode.Append: a) If table exists: Use AppendData.byName b) If table doesn't exist, use CTAS (ignoreIfExists = false) - SaveMode.Overwrite: Use RTAS (orCreate = true) - SaveMode.Ignore: Use CTAS (ignoreIfExists = true) - SaveMode.ErrorIfExists: Use CTAS (ignoreIfExists = false) ## How was this patch tested? Unit tests in DataSourceV2DataFrameSuite Closes #25330 from brkyvz/saveAsTable. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>

## What changes were proposed in this pull request? There are "unsafe" conversions in the Kafka connector. `CaseInsensitiveStringMap` comes in which is then converted the following way: ``` ... options.asScala.toMap ... ``` The main problem with this is that such case it looses its case insensitive nature (case insensitive map is converting the key to lower case when get/contains called). In this PR I'm using `CaseInsensitiveMap` to solve this problem. ## How was this patch tested? Existing + additional unit tests. Closes #24967 from gaborgsomogyi/SPARK-28163. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…lable ## What changes were proposed in this pull request? Right now, batch DataFrame always changes the schema to nullable automatically (See this line: https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399). But streaming file source is missing this. This PR updates the streaming file source schema to force it be nullable. I also added a flag `spark.sql.streaming.fileSource.schema.forceNullable` to disable this change since some users may rely on the old behavior. ## How was this patch tested? The new unit test. Closes #25382 from zsxwing/SPARK-28651. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…on 3 ## What changes were proposed in this pull request? I made an audit and update all dev scripts to support python3. (except `merge_spark_pr.py` which already updated) ## How was this patch tested? Manual. Closes #25289 from WeichenXu123/dev_py3. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

## What changes were proposed in this pull request? This PR is a follow-up to #24918 ## How was this patch tested? Pass the Jenkins with the newly update test files. Closes #25393 from beliefer/enable-overlay-tests. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

## What changes were proposed in this pull request? In this PR, we implements a complete process of GPU-aware resources scheduling in Standalone. The whole process looks like: Worker sets up isolated resources when it starts up and registers to master along with its resources. And, Master picks up usable workers according to driver/executor's resource requirements to launch driver/executor on them. Then, Worker launches the driver/executor after preparing resources file, which is created under driver/executor's working directory, with specified resource addresses(told by master). When driver/executor finished, their resources could be recycled to worker. Finally, if a worker stops, it should always release its resources firstly. For the case of Workers and Drivers in **client** mode run on the same host, we introduce a config option named `spark.resources.coordinate.enable`(default true) to indicate whether Spark should coordinate resources for user. If `spark.resources.coordinate.enable=false`, user should be responsible for configuring different resources for Workers and Drivers when use resourcesFile or discovery script. If true, Spark would help user to assign different resources for Workers and Drivers. The solution for Spark to coordinate resources among Workers and Drivers is: Generally, use a shared file named *____allocated_resources____.json* to sync allocated resources info among Workers and Drivers on the same host. After a Worker or Driver found all resources using the configured resourcesFile and/or discovery script during launching, it should filter out available resources by excluding resources already allocated in *____allocated_resources____.json* and acquire resources from available resources according to its own requirement. After that, it should write its allocated resources along with its process id (pid) into *____allocated_resources____.json*. Pid (proposed by tgravescs) here used to check whether the allocated resources are still valid in case of Worker or Driver crashes and doesn't release resources properly. And when a Worker or Driver finished, normally, it would always clean up its own allocated resources in *____allocated_resources____.json*. Note that we'll always get a file lock before any access to file *____allocated_resources____.json* and release the lock finally. Futhermore, we appended resources info in `WorkerSchedulerStateResponse` to work around master change behaviour in HA mode. ## How was this patch tested? Added unit tests in WorkerSuite, MasterSuite, SparkContextSuite. Manually tested with client/cluster mode (e.g. multiple workers) in a single node Standalone. Closes #25047 from Ngone51/SPARK-27371. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Thomas Graves <tgraves@apache.org>

… older releases ### What changes were proposed in this pull request? Fall back to archive.apache.org in `build/mvn` to download Maven, in case the ASF mirrors no longer have an older release. ### Why are the changes needed? If an older release's specified Maven doesn't exist in the mirrors, {{build/mvn}} will fail. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested different paths and failures by commenting in/out parts of the script and modifying it directly. Closes #25667 from srowen/SPARK-28963. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…dd offHeapMemorySize ## What changes were proposed in this pull request? If MEMORY_OFFHEAP_ENABLED is true, add MEMORY_OFFHEAP_SIZE to resource requested for executor to ensure instance has enough memory to use. In this pr add a helper method `executorOffHeapMemorySizeAsMb` in `YarnSparkHadoopUtil`. ## How was this patch tested? Add 3 new test suite to test `YarnSparkHadoopUtil#executorOffHeapMemorySizeAsMb` Closes #25309 from LuciferYang/spark-28577. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Thomas Graves <tgraves@apache.org>

…n JDBC Tab UI ### What changes were proposed in this pull request? Current Spark Thirft Server can't support cancel SQL job, when we use Hue to query throgh Spark Thrift Server, when we run a sql and then click cancel button to cancel this sql, we will it won't work in backend and in the spark JDBC UI tab, we can see the SQL's status is always COMPILED, then the duration of SQL is always increasing, this may make people confused. ![image](https://user-images.githubusercontent.com/46485123/63869830-60338f00-c9eb-11e9-8776-cee965adcb0a.png) ### Why are the changes needed? If sql status can't reflect sql's true status, it will make user confused. ### Does this PR introduce any user-facing change? SparkthriftServer's UI tab will show SQL's status in CANCELED when we cancel a SQL . ### How was this patch tested? Manuel tested UI TAB Status ![image](https://user-images.githubusercontent.com/46485123/63915010-80a12f00-ca67-11e9-9342-830dfa9c719f.png) ![image](https://user-images.githubusercontent.com/46485123/63915084-a9292900-ca67-11e9-8e26-375bf8ce0963.png) backend log ![image](https://user-images.githubusercontent.com/46485123/63914864-1092a900-ca67-11e9-93f2-08690ed9abf4.png) Closes #25611 from AngersZhuuuu/SPARK-28901. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

### What changes were proposed in this pull request? This patch fixes the bugs in test code itself, FsHistoryProviderSuite. 1. When creating log file via `newLogFile`, codec is ignored, leading to wrong file name. (No one tends to create test for test code, as well as the bug doesn't affect existing tests indeed, so not easy to catch.) 2. When writing events to log file via `writeFile`, metadata (in case of new format) gets written to file regardless of its codec, and the content is overwritten by another stream, hence no information for Spark version is available. It affects existing test, hence we have wrong expected value to workaround the bug. This patch also removes redundant parameter `isNewFormat` in `writeFile`, as according to review comment, Spark no longer supports old format. ### Why are the changes needed? Explained in above section why they're bugs, though they only reside in test-code. (Please note that the bug didn't come from non-test side of code.) ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modified existing UTs, as well as read event log file in console to see metadata is not overwritten by other contents. Closes #25629 from HeartSaVioR/MINOR-FIX-FsHistoryProviderSuite. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

## What changes were proposed in this pull request? This patch does pooling for both kafka consumers as well as fetched data. The overall benefits of the patch are following: * Both pools support eviction on idle objects, which will help closing invalid idle objects which topic or partition are no longer be assigned to any tasks. * It also enables applying different policies on pool, which helps optimization of pooling for each pool. * We concerned about multiple tasks pointing same topic partition as well as same group id, and existing code can't handle this hence excess seek and fetch could happen. This patch properly handles the case. * It also makes the code always safe to leverage cache, hence no need to maintain reuseCache parameter. Moreover, pooling kafka consumers is implemented based on Apache Commons Pool, which also gives couple of benefits: * We can get rid of synchronization of KafkaDataConsumer object while acquiring and returning InternalKafkaConsumer. * We can extract the feature of object pool to outside of the class, so that the behaviors of the pool can be tested easily. * We can get various statistics for the object pool, and also be able to enable JMX for the pool. FetchedData instances are pooled by custom implementation of pool instead of leveraging Apache Commons Pool, because they have CacheKey as first key and "desired offset" as second key which "desired offset" is changing - I haven't found any general pool implementations supporting this. This patch brings additional dependency, Apache Commons Pool 2.6.0 into `spark-sql-kafka-0-10` module. ## How was this patch tested? Existing unit tests as well as new tests for object pool. Also did some experiment regarding proving concurrent access of consumers for same topic partition. * Made change on both sides (master and patch) to log when creating Kafka consumer or fetching records from Kafka is happening. * branches * master: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-master-ref-debugging * patch: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-debugging * Test query (doing self-join) * https://gist.github.com/HeartSaVioR/d831974c3f25c02846f4b15b8d232cc2 * Ran query from spark-shell, with using `local[*]` to maximize the chance to have concurrent access * Collected the count of fetch requests on Kafka via command: `grep "creating new Kafka consumer" logfile | wc -l` * Collected the count of creating Kafka consumers via command: `grep "fetching data from Kafka consumer" logfile | wc -l` Topic and data distribution is follow: ``` truck_speed_events_stream_spark_25151_v1:0:99440 truck_speed_events_stream_spark_25151_v1:1:99489 truck_speed_events_stream_spark_25151_v1:2:397759 truck_speed_events_stream_spark_25151_v1:3:198917 truck_speed_events_stream_spark_25151_v1:4:99484 truck_speed_events_stream_spark_25151_v1:5:497320 truck_speed_events_stream_spark_25151_v1:6:99430 truck_speed_events_stream_spark_25151_v1:7:397887 truck_speed_events_stream_spark_25151_v1:8:397813 truck_speed_events_stream_spark_25151_v1:9:0 ``` The experiment only used smallest 4 partitions (0, 1, 4, 6) from these partitions to finish the query earlier. The result of experiment is below: branch | create Kafka consumer | fetch request -- | -- | -- master | 1986 | 2837 patch | 8 | 1706 Closes #22138 from HeartSaVioR/SPARK-25151. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Co-authored-by: Jungtaek Lim <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

### What changes were proposed in this pull request? Document SHOW TBLPROPERTIES statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. **Before:** There was no documentation for this. **After.** ![image](https://user-images.githubusercontent.com/11567269/64281442-fdb92200-cf07-11e9-90ba-4699b6e93e23.png) ![Screen Shot 2019-09-04 at 11 32 11 AM](https://user-images.githubusercontent.com/11567269/64281484-188b9680-cf08-11e9-8e42-f130751ca495.png) ### How was this patch tested? Tested using jykyll build --serve Closes #25571 from dilipbiswal/ref-show-tblproperties. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

### What changes were proposed in this pull request? Document SHOW FUNCTIONS statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. **Before:** There was no documentation for this. **After.** ![image](https://user-images.githubusercontent.com/11567269/64281840-e3cc0f00-cf08-11e9-9784-f01392276130.png) <img width="589" alt="Screen Shot 2019-09-04 at 11 41 44 AM" src="https://user-images.githubusercontent.com/11567269/64281911-0fe79000-cf09-11e9-955f-21b44590707c.png"> <img width="572" alt="Screen Shot 2019-09-04 at 11 41 54 AM" src="https://user-images.githubusercontent.com/11567269/64281916-12e28080-cf09-11e9-9187-688c2c751559.png"> ### How was this patch tested? Tested using jykyll build --serve Closes #25539 from dilipbiswal/ref-doc-show-functions. Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

### What changes were proposed in this pull request? This patch implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach: 1. As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the `ReuseExchange` rule; or 2. As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise 3. As a bypassed condition (`true`). ### Why are the changes needed? This is an important performance feature. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT - Testing DPP by enabling / disabling the reuse broadcast results feature and / or the subquery duplication feature. - Testing DPP with reused broadcast results. - Testing the key iterators on different HashedRelation types. - Testing the packing and unpacking of the broadcast keys in a LongType. Closes #25600 from maryannxue/dpp. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

…ion-tests ### What changes were proposed in this pull request? Per #25640 (comment) also bump K8S client version in integration-tests module. ### Why are the changes needed? Harmonize the version as intended. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #25664 from srowen/SPARK-28921.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

### What changes were proposed in this pull request? Replaces some incorrect usage of `new Configuration()` as it will load default configs defined in Hadoop ### Why are the changes needed? Unexpected config could be accessed instead of the expected config, see SPARK-28203 for example ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existed tests. Closes #25616 from advancedxy/remove_invalid_configuration. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

### What changes were proposed in this pull request? Rename `UnresolvedTable` to `V1Table` because it is not unresolved. ### Why are the changes needed? The class name is inaccurate. This should be fixed before it is in a release. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25683 from rdblue/SPARK-28979-rename-unresolved-table. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ence ### What changes were proposed in this pull request? Document UNCACHE TABLE statement in SQL Reference ### Why are the changes needed? To complete SQL Reference ### Does this PR introduce any user-facing change? Yes. After change: ![image](https://user-images.githubusercontent.com/13592258/64299133-e04a7f00-cf2c-11e9-8f39-9b288e46c995.png) ### How was this patch tested? Tested using jykyll build --serve Closes #25540 from huaxingao/spark-28830. Lead-authored-by: Huaxin Gao <huaxing@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

### What changes were proposed in this pull request? Use `KeyLock` added in #25612 to simplify `MapOutputTracker.getStatuses`. It also has some improvement after the refactoring: - `InterruptedException` is no longer sallowed. - When a shuffle block is fetched, we don't need to wake up unrelated sleeping threads. ### Why are the changes needed? `MapOutputTracker.getStatuses` is pretty hard to maintain right now because it has a special lock mechanism which we needs to pay attention to whenever updating this method. As we can use `KeyLock` to hide the complexity of locking behind a dedicated lock class, it's better to refactor it to make it easy to understand and maintain. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #25680 from zsxwing/getStatuses. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

… saveAsTable ### What changes were proposed in this pull request? Adds the provider information to the table properties in saveAsTable. ### Why are the changes needed? Otherwise, catalog implementations don't know what kind of Table definition to create. ### Does this PR introduce any user-facing change? nope ### How was this patch tested? Existing unit tests check the existence of the provider now. Closes #25669 from brkyvz/provider. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…batches ### What changes were proposed in this pull request? Remove unnecessary physical projection added to ensure rows are `UnsafeRow` when the DSv2 scan is columnar. This is not needed because conversions are automatically added to convert from columnar operators to `UnsafeRow` when the next operator does not support columnar execution. ### Why are the changes needed? Removes an extra projection and copy. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25586 from rdblue/SPARK-28878-remove-dsv2-project-with-columnar. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Support generator in aggregate expressions. In this PR, I check the aggregate logical plan, if its aggregateExpressions include generator, then convert this agg plan into "normal agg plan + generator plan + projection plan". I.e: ``` aggregate(with generator) |--child_plan ``` ===> ``` project |--generator(resolved) |--aggregate |--child_plan ``` ### Why are the changes needed? We should support sql like: ``` select explode(array(min(a), max(a))) from t ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test added. Closes #25512 from WeichenXu123/explode_bug. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…rtition column can be numeric, date or timestamp type ### What changes were proposed in this pull request? `DataFrameReader.json()` accepts a partition column that is of numeric, date or timestamp type, according to the implementation in `JDBCRelation.scala`. Update the scaladoc accordingly, to match the documentation in `sql-data-sources-jdbc.md` too. ### Why are the changes needed? scaladoc is incorrect. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #25687 from srowen/SPARK-28977. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…part4.sql' into UDF test base ### What changes were proposed in this pull request? This PR proposes to port `pgSQL/aggregates_part4.sql` into UDF test base. <details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary> <p> ```diff ``` </p> </details> ### Why are the changes needed? To improve test coverage in UDFs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested via: ```bash build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part4.sql" ``` as guided in https://issues.apache.org/jira/browse/SPARK-27921 Closes #25677 from HyukjinKwon/SPARK-28971. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…part3.sql' into UDF test base ### What changes were proposed in this pull request? This PR proposes to port `pgSQL/aggregates_part3.sql` into UDF test base. <details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary> <p> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out index f102383..eff33f2 100644 --- a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out -3,7 +3,7 -- !query 0 -select max(min(unique1)) from tenk1 +select udf(max(min(unique1))) from tenk1 -- !query 0 schema struct<> -- !query 0 output -12,11 +12,11 It is not allowed to use an aggregate function in the argument of another aggreg -- !query 1 -select (select count(*) - from (values (1)) t0(inner_c)) +select udf((select udf(count(*)) + from (values (1)) t0(inner_c))) as col from (values (2),(3)) t1(outer_c) -- !query 1 schema -struct<scalarsubquery():bigint> +struct<col:bigint> -- !query 1 output 1 1 ``` </p> </details> ### Why are the changes needed? To improve test coverage in UDFs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested via: ```bash build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part3.sql" ``` as guided in https://issues.apache.org/jira/browse/SPARK-27921 Closes #25676 from HyukjinKwon/SPARK-28272. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? merge the `V2WriteSupportCheck` and `V2StreamingScanSupportCheck` to one rule: `TableCapabilityCheck`. ### Why are the changes needed? It's a little confusing to have 2 rules to check DS v2 table capability, while one rule says it checks write and another rule says it checks streaming scan. We can clearly tell it from the rule names that the batch scan check is missing. It's better to have a centralized place for this check, with a name that clearly says it checks table capability. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #25679 from cloud-fan/dsv2-check. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…memory derby ## What changes were proposed in this pull request? This PR disables schema verification and allows schema auto-creation in the Derby database, in case the config for the Metastore is set otherwise. ## How was this patch tested? NA Closes #25663 from bogdanghit/hive-schema. Authored-by: Bogdan Ghit <bogdan.ghit@databricks.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>

…s fail ### What changes were proposed in this pull request? `ReplayListenerSuite` depends on a listener class to listen for replayed events. This class was implemented by extending `EventLoggingListener`. `EventLoggingListener` does not log executor metrics update events, but uses them to update internal state; on a stage completion event, it then logs stage executor metrics events using this internal state. As executor metrics update events do not get written to the event log, they do not get replayed. The internal state of the replay listener can therefore be different from the original listener, leading to different stage completion events being logged. We reimplement the replay listener to simply buffer each and every event it receives. This makes it a simpler yet better tool for verifying the events that get sent through the ReplayListenerBus. ### Why are the changes needed? As explained above. Tests sometimes fail due to events being received by the `EventLoggingListener` that do not get logged (and thus do not get replayed) but influence other events that get logged. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #25673 from wypoon/SPARK-28770. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>

### What changes were proposed in this pull request? Add a listListeners() method to StreamingQueryManager that lists all StreamingQueryListeners that have been added to that manager. ### Why are the changes needed? While it's best practice to keep handles on all listeners added, it's still nice to have an API to be able to list what listeners have been added to a StreamingQueryManager. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modified existing unit tests to use the new API instead of using reflection. Closes #25518 from mukulmurthy/26046-listener. Authored-by: Mukul Murthy <mukul.murthy@gmail.com> Signed-off-by: Jose Torres <torres.joseph.f+github@gmail.com>

…eference ### What changes were proposed in this pull request? Document DESCRIBE DATABASE statement in SQL Reference ### Why are the changes needed? To complete the SQL Reference ### Does this PR introduce any user-facing change? Yes #### Before There is no documentation for this command in sql reference #### After ![Screen Shot 2019-09-05 at 12 59 32 PM](https://user-images.githubusercontent.com/7550280/64379235-53aec800-cfe3-11e9-8a51-ea55f0455c47.png) ![Screen Shot 2019-09-05 at 12 59 45 PM](https://user-images.githubusercontent.com/7550280/64379247-58737c00-cfe3-11e9-9a51-f12c5c5bc26a.png) ### How was this patch tested? Used jekyll build and serve to verify Closes #25528 from kevinyu98/sql-ref-describe. Lead-authored-by: Kevin Yu <qyu@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

…te temporary path in local staging directory ## What changes were proposed in this pull request? Th environment of my cluster as follows: ``` OS:Linux version 2.6.32-220.7.1.el6.x86_64 (mockbuildc6b18n3.bsys.dev.centos.org) (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #1 SMP Wed Mar 7 00:52:02 GMT 2012 Hadoop: 2.7.2 Spark: 2.3.0 or 3.0.0(master branch) Hive: 1.2.1 ``` My spark run on deploy mode yarn-client. If I execute the SQL `insert overwrite local directory '/home/test/call_center/' select * from call_center`, a HiveException will appear as follows: `Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Mkdirs failed to create file:/home/xitong/hive/stagingdir_hive_2019-02-19_17-31-00_678_1816816774691551856-1/-ext-10000/_temporary/0/_temporary/attempt_20190219173233_0002_m_000000_3 (exists=false, cwd=file:/data10/yarn/nm-local-dir/usercache/xitong/appcache/application_1543893582405_6126857/container_e124_1543893582405_6126857_01_000011) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)` Current spark sql generate a local temporary path in local staging directory.The schema of local temporary path start with `file`, so the HiveException appears. This PR change the local temporary path to HDFS temporary path, and use DistributedFileSystem instance copy the data from HDFS temporary path to local directory. If Spark run on local deploy mode, 'insert overwrite local directory' works fine. ## How was this patch tested? UT cannot support yarn-client mode.The test is in my product environment. Closes apache#23841 from beliefer/fix-bug-of-insert-overwrite-local-dir. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>

## What changes were proposed in this pull request? This PR supports `OpenJ9` in addition to `IBM JDK` and `OpenJDK` in Spark by handling `System.getProperty("java.vendor") = "Eclipse OpenJ9"`. In `inferDefaultMemory()` and `getKrb5LoginModuleName()`, this PR uses non `IBM` way. ``` $ ~/jdk-11.0.2+9_openj9-0.12.1/bin/jshell | Welcome to JShell -- Version 11.0.2 | For an introduction type: /help intro jshell> System.out.println(System.getProperty("java.vendor")) Eclipse OpenJ9 jshell> System.out.println(System.getProperty("java.vm.info")) JRE 11 Linux amd64-64-Bit Compressed References 20190204_127 (JIT enabled, AOT enabled) OpenJ9 - 90dd8cb40 OMR - d2f4534b JCL - 289c70b6844 based on jdk-11.0.2+9 jshell> System.out.println(Class.forName("com.ibm.lang.management.OperatingSystemMXBean").getDeclaredMethod("getTotalPhysicalMemory")) public abstract long com.ibm.lang.management.OperatingSystemMXBean.getTotalPhysicalMemory() jshell> System.out.println(Class.forName("com.sun.management.OperatingSystemMXBean").getDeclaredMethod("getTotalPhysicalMemorySize")) public abstract long com.sun.management.OperatingSystemMXBean.getTotalPhysicalMemorySize() jshell> System.out.println(Class.forName("com.ibm.security.auth.module.Krb5LoginModule")) | Exception java.lang.ClassNotFoundException: com.ibm.security.auth.module.Krb5LoginModule | at Class.forNameImpl (Native Method) | at Class.forName (Class.java:339) | at (#1:1) jshell> System.out.println(Class.forName("com.sun.security.auth.module.Krb5LoginModule")) class com.sun.security.auth.module.Krb5LoginModule ``` ## How was this patch tested? Existing test suites Manual testing with OpenJ9. Closes apache#24308 from kiszk/SPARK-27397. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

…comparison assertions ## What changes were proposed in this pull request? This PR removes a few hardware-dependent assertions which can cause a failure in `aarch64`. **x86_64** ``` rootdonotdel-openlab-allinone-l00242678:/home/ubuntu# uname -a Linux donotdel-openlab-allinone-l00242678 4.4.0-154-generic apache#181-Ubuntu SMP Tue Jun 25 05:29:03 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux scala> import java.lang.Float.floatToRawIntBits import java.lang.Float.floatToRawIntBits scala> floatToRawIntBits(0.0f/0.0f) res0: Int = -4194304 scala> floatToRawIntBits(Float.NaN) res1: Int = 2143289344 ``` **aarch64** ``` [rootarm-huangtianhua spark]# uname -a Linux arm-huangtianhua 4.14.0-49.el7a.aarch64 #1 SMP Tue Apr 10 17:22:26 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux scala> import java.lang.Float.floatToRawIntBits import java.lang.Float.floatToRawIntBits scala> floatToRawIntBits(0.0f/0.0f) res1: Int = 2143289344 scala> floatToRawIntBits(Float.NaN) res2: Int = 2143289344 ``` ## How was this patch tested? Pass the Jenkins (This removes the test coverage). Closes apache#25186 from huangtianhua/special-test-case-for-aarch64. Authored-by: huangtianhua <huangtianhua@huawei.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? `org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite` failed lately. After had a look at the logs it just shows the following fact without any details: ``` Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database ``` Since the issue is intermittent and not able to reproduce it we should add more debug information and wait for reproduction with the extended logs. ### Why are the changes needed? Failing test doesn't give enough debug information. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I've started the test manually and checked that such additional debug messages show up: ``` >>> KrbApReq: APOptions are 00000000 00000000 00000000 00000000 >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType Looking for keys for: kafka/localhostEXAMPLE.COM Added key: 17version: 0 Added key: 23version: 0 Added key: 16version: 0 Found unsupported keytype (3) for kafka/localhostEXAMPLE.COM >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType Using builtin default etypes for permitted_enctypes default etypes for permitted_enctypes: 17 16 23. >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType MemoryCache: add 1571936500/174770/16C565221B70AAB2BEFE31A83D13A2F4/client/localhostEXAMPLE.COM to client/localhostEXAMPLE.COM|kafka/localhostEXAMPLE.COM MemoryCache: Existing AuthList: #3: 1571936493/200803/8CD70D280B0862C5DA1FF901ECAD39FE/client/localhostEXAMPLE.COM #2: 1571936499/985009/BAD33290D079DD4E3579A8686EC326B7/client/localhostEXAMPLE.COM #1: 1571936499/995208/B76B9D78A9BE283AC78340157107FD40/client/localhostEXAMPLE.COM ``` Closes apache#26252 from gaborgsomogyi/SPARK-29580. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Currently the join operators are not well abstracted, since there are lot of common logic. A trait can be created for easier pattern matching and other future handiness. This is a follow-up PR based on comment apache#27509 (comment) . This PR refined from the following aspects: 1. Refined structure of all physical join operators 2. Add missing joinType field for CartesianProductExec operator 3. Refined codes related to Explain Formatted The EXPLAIN FORMATTED changes are 1. Converge all join operator `verboseStringWithOperatorId` implementations to `BaseJoinExec`. Join condition displayed, and join keys displayed if it’s not empty. 2. `#1` will add Join condition to `BroadcastNestedLoopJoinExec`. 3. `#1` will **NOT** affect `CartesianProductExec`,`SortMergeJoin` and `HashJoin`s, since they already got there override implementation before. 4. Converge all join operator `simpleStringWithNodeId` to `BaseJoinExec`, which will enhance the one line description for `CartesianProductExec` with `JoinType` added. 5. Override `simpleStringWithNodeId` in `BroadcastNestedLoopJoinExec` to show `BuildSide`, which was only done for `HashJoin`s before. ### Why are the changes needed? Make the code consistent with other operators and for future handiness of join operators. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes apache#27595 from Eric5553/RefineJoin. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

HyukjinKwon and others added 30 commits August 5, 2019 23:36

srowen and others added 24 commits September 4, 2019 13:11

shanyu merged commit 624206e into shanyu:master Sep 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

merge from apache master#1

merge from apache master#1
shanyu merged 1948 commits intoshanyu:masterfrom
apache:master

shanyu commented Sep 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments

Conversation

shanyu commented Sep 6, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants