pull master #2

ulysses-you · 2019-07-19T01:16:16Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review https://spark.apache.org/contributing.html before opening a pull request.

…kLauncher on Windows ## What changes were proposed in this pull request? When using SparkLauncher to submit applications **concurrently** with multiple threads under **Windows**, some apps would show that "The process cannot access the file because it is being used by another process" and remains in LOST state at the end. The issue can be reproduced by this [demo](https://issues.apache.org/jira/secure/attachment/12973920/Main.scala). After digging into the code, I find that, Windows cmd `%RANDOM%` would return the same number if we call it instantly(e.g. < 500ms) after last call. As a result, SparkLauncher would get same output file(spark-class-launcher-output-%RANDOM%.txt) for apps. Then, the following app would hit the issue when it tries to write the same file which has already been opened for writing by another app. We should make sure to generate unique output file for SparkLauncher on Windows to avoid this issue. ## How was this patch tested? Tested manually on Windows. Closes #25076 from Ngone51/SPARK-28302. Authored-by: wuyi <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

## What changes were proposed in this pull request? This PR is to port int2.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/int2.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/int2.out When porting the test cases, found two PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28023](https://issues.apache.org/jira/browse/SPARK-28023): Trim the string when cast string type to other types [SPARK-28027](https://issues.apache.org/jira/browse/SPARK-28027): Add bitwise shift left/right operators Also, found a bug: [SPARK-28024](https://issues.apache.org/jira/browse/SPARK-28024): Incorrect value when out of range Also, found three inconsistent behavior: [SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Invalid input syntax for smallint throws exception at PostgreSQL [SPARK-28028](https://issues.apache.org/jira/browse/SPARK-28028): Cast numeric to integral type need round [SPARK-2659](https://issues.apache.org/jira/browse/SPARK-2659): HiveQL: Division operator should always perform fractional division, for example: ```sql select 1/2; ``` ## How was this patch tested? N/A Closes #24853 from wangyum/SPARK-28029. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…edRowMatrix constructors ## What changes were proposed in this pull request? In both cases, the input `DataFrame` schema must contain only the information that's required for the matrix object, so a vector column in the case of `RowMatrix` and long and vector columns for `IndexedRowMatrix`. ## How was this patch tested? Unit tests that verify: - `RowMatrix` and `IndexedRowMatrix` can be created from `DataFrame`s - If the schema does not match expectations, we throw an `IllegalArgumentException` Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #24953 from henrydavidge/row-matrix-df. Authored-by: Henry D <[email protected]> Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? This PR is to port int8.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/int8.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/int8.out When porting the test cases, found two PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28137](https://issues.apache.org/jira/browse/SPARK-28137): Missing Data Type Formatting Functions [SPARK-28027](https://issues.apache.org/jira/browse/SPARK-28027): Missing some mathematical operators Also, found three inconsistent behavior: [SPARK-26218](https://issues.apache.org/jira/browse/SPARK-28024): Throw exception on overflow for integers [SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Spark SQL insert bad inputs to NULL [SPARK-28028](https://issues.apache.org/jira/browse/SPARK-28028): Cast numeric to integral type need round [SPARK-2659](https://issues.apache.org/jira/browse/SPARK-2659): HiveQL: Division operator should always perform fractional division, for example: ```sql select 1/2; ``` ## How was this patch tested? N/A Closes #24933 from wangyum/SPARK-28136. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…-3.2) ## What changes were proposed in this pull request? Since [SPARK-23710](https://issues.apache.org/jira/browse/SPARK-23710), Hadoop 3.x can support Hive. This PR add _build with `hadoop-3.2`_ to building-spark.md. ## How was this patch tested? manual tests ``` cd docs SKIP_API=1 jekyll build ``` ![image](https://user-images.githubusercontent.com/5399861/60942057-cf5a0480-a313-11e9-9534-4765520e799f.png) Closes #25063 from wangyum/SPARK-28267. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…ration ## What changes were proposed in this pull request? Up to now, Apache Spark maintains the given event log directory by **time** policy, `spark.history.fs.cleaner.maxAge`. However, there are two issues. 1. Some file system has a limitation on the maximum number of files in a single directory. For example, HDFS `dfs.namenode.fs-limits.max-directory-items` is 1024 * 1024 by default. https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml 2. Spark is sometimes unable to to clean up some old log files due to permission issues (mainly, security policy). To handle both (1) and (2), this PR aims to support an additional policy configuration for the maximum number of files in the event log directory, `spark.history.fs.cleaner.maxNum`. Spark will try to keep the number of files in the event log directory according to this policy. ## How was this patch tested? Pass the Jenkins with a newly added test case. Closes #25072 from dongjoon-hyun/SPARK-28294. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…PECT) NULLS]?) syntax ## What changes were proposed in this pull request? According to the ANSI SQL 2011 ![image](https://user-images.githubusercontent.com/698621/60855327-d01c6900-a235-11e9-9a1b-d438615a4673.png) Below are Teradata, Oracle, Redshift which already support this grammar. - Teradata - https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/SUwCpTupqmlBJvi2mipOaA - Oracle - https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC - Redshift – https://docs.aws.amazon.com/redshift/latest/dg/r_WF_first_value.html - Postgresql didn't implement this grammar: https://www.postgresql.org/docs/devel/functions-window.html >The SQL standard defines a RESPECT NULLS or IGNORE NULLS option for lead, lag, first_value, last_value, and nth_value. This is not implemented in PostgreSQL: the behavior is always the same as the standard's default, namely RESPECT NULLS. ## How was this patch tested? UT. Closes #25082 from lipzhu/SPARK-28310. Authored-by: Zhu, Lipeng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ontextFactory ## What changes were proposed in this pull request? `SslContextFactory` is deprecated at Jetty 9.4 and we are using `9.4.18.v20190429`. This PR aims to replace it with `SslContextFactory.Server`. - https://www.eclipse.org/jetty/javadoc/9.4.19.v20190610/org/eclipse/jetty/util/ssl/SslContextFactory.html - https://www.eclipse.org/jetty/javadoc/9.3.24.v20180605/org/eclipse/jetty/util/ssl/SslContextFactory.html ``` [WARNING] /Users/dhyun/APACHE/spark/core/src/main/scala/org/apache/spark/SSLOptions.scala:71: constructor SslContextFactory in class SslContextFactory is deprecated: see corresponding Javadoc for more information. [WARNING] val sslContextFactory = new SslContextFactory() [WARNING] ^ ``` ## How was this patch tested? Pass the Jenkins with the existing tests. Closes #25067 from dongjoon-hyun/SPARK-28290. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…sync commit ## What changes were proposed in this pull request? `DirectKafkaStreamSuite.offset recovery from kafka` commits offsets to Kafka with `Consumer.commitAsync` API (and then reads it back). Since this API is asynchronous it may send notifications late(or not at all). The actual test makes the assumption if the data sent and collected then the offset must be committed as well. This is not true. In this PR I've made the following modifications: * Wait for async offset commit before context stopped * Added commit succeed log to see whether it arrived at all * Using `ConcurrentHashMap` for committed offsets because 2 threads are using the variable (`JobGenerator` and `ScalaTest...`) ## How was this patch tested? Existing unit test in a loop + jenkins runs. Closes #25100 from gaborgsomogyi/SPARK-28335. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ndition ## What changes were proposed in this pull request? There is a bug in `ExtractPythonUDFs` that produces wrong result attributes. It causes a failure when using `PythonUDF`s among multiple child plans, e.g., join. An example is using `PythonUDF`s in join condition. ```python >>> left = spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, a1=2, a2=2)]) >>> right = spark.createDataFrame([Row(b=1, b1=1, b2=1), Row(b=1, b1=3, b2=1)]) >>> f = udf(lambda a: a, IntegerType()) >>> df = left.join(right, [f("a") == f("b"), left.a1 == right.b1]) >>> df.collect() 19/07/10 12:20:49 ERROR Executor: Exception in task 5.0 in stage 0.0 (TID 5) java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:201) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getAs(rows.scala:35) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.isNullAt(rows.scala:36) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.isNullAt$(rows.scala:36) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.isNullAt(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70) ... ``` ## How was this patch tested? Added test. Closes #25091 from viirya/SPARK-28323. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Bryan Cutler <[email protected]>

…o get resources ## What changes were proposed in this pull request? Add python api support and JavaSparkContext support for resources(). I needed the JavaSparkContext support for it to properly translate into python with the py4j stuff. ## How was this patch tested? Unit tests added and manually tested in local cluster mode and on yarn. Closes #25087 from tgravescs/SPARK-28234-python. Authored-by: Thomas Graves <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

… into UDF test base ## What changes were proposed in this pull request? This PR adds some tests converted from `natural-join.sql` to test UDFs following the combination guide in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). <details><summary>Diff results comparing to `natural-join.sql`</summary> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-natural-join.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-natural-join. sql.out index 43f2f9a..53ef177 100644 --- a/sql/core/src/test/resources/sql-tests/results/udf/udf-natural-join.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-natural-join.sql.out -27,7 +27,7 struct<> -- !query 2 -SELECT * FROM nt1 natural join nt2 where k = "one" +SELECT * FROM nt1 natural join nt2 where udf(k) = "one" -- !query 2 schema struct<k:string,v1:int,v2:int> -- !query 2 output -36,7 +36,7 one 1 5 -- !query 3 -SELECT * FROM nt1 natural left join nt2 order by v1, v2 +SELECT * FROM nt1 natural left join nt2 where k <> udf("") order by v1, v2 -- !query 3 schema diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-natural-join.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-natural-join. sql.out index 43f2f9a..53ef177 100644 --- a/sql/core/src/test/resources/sql-tests/results/udf/udf-natural-join.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-natural-join.sql.out -27,7 +27,7 struct<> -- !query 2 -SELECT * FROM nt1 natural join nt2 where k = "one" +SELECT * FROM nt1 natural join nt2 where udf(k) = "one" -- !query 2 schema struct<k:string,v1:int,v2:int> -- !query 2 output -36,7 +36,7 one 1 5 -- !query 3 -SELECT * FROM nt1 natural left join nt2 order by v1, v2 +SELECT * FROM nt1 natural left join nt2 where k <> udf("") order by v1, v2 -- !query 3 schema struct<k:string,v1:int,v2:int> ``` </details> ## How was this patch tested? Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). Closes #25088 from manuzhang/SPARK-27922. Authored-by: manu.zhang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…DF test base ## What changes were proposed in this pull request? This PR adds some tests converted from 'count.sql' to test UDFs <details><summary>Diff comparing to 'count.sql'</summary> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/count.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-count.sql.out index b8a86d4..9476937 100644 --- a/sql/core/src/test/resources/sql-tests/results/count.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-count.sql.out -14,42 +14,42 struct<> -- !query 1 SELECT - count(*), count(1), count(null), count(a), count(b), count(a + b), count((a, b)) + udf(count(*)), udf(count(1)), udf(count(null)), udf(count(a)), udf(count(b)), udf(count(a + b)), udf(count((a, b))) FROM testData -- !query 1 schema -struct<count(1):bigint,count(1):bigint,count(NULL):bigint,count(a):bigint,count(b):bigint,count((a + b)):bigint,count(named_struct(a, a, b, b)):bigint> +struct<udf(count(1)):string,udf(count(1)):string,udf(count(null)):string,udf(count(a)):string,udf(count(b)):string,udf(count((a + b))):string,udf(count(named_struct(a, a, b, b))):string> -- !query 1 output 7 7 0 5 5 4 7 -- !query 2 SELECT - count(DISTINCT 1), - count(DISTINCT null), - count(DISTINCT a), - count(DISTINCT b), - count(DISTINCT (a + b)), - count(DISTINCT (a, b)) + udf(count(DISTINCT 1)), + udf(count(DISTINCT null)), + udf(count(DISTINCT a)), + udf(count(DISTINCT b)), + udf(count(DISTINCT (a + b))), + udf(count(DISTINCT (a, b))) FROM testData -- !query 2 schema -struct<count(DISTINCT 1):bigint,count(DISTINCT NULL):bigint,count(DISTINCT a):bigint,count(DISTINCT b):bigint,count(DISTINCT (a + b)):bigint,count(DISTINCT named_struct(a, a, b, b)):bigint> +struct<udf(count(distinct 1)):string,udf(count(distinct null)):string,udf(count(distinct a)):string,udf(count(distinct b)):string,udf(count(distinct (a + b))):string,udf(count(distinct named_struct(a, a, b, b))):string> -- !query 2 output 1 0 2 2 2 6 -- !query 3 -SELECT count(a, b), count(b, a), count(testData.*) FROM testData +SELECT udf(count(a, b)), udf(count(b, a)), udf(count(testData.*)) FROM testData -- !query 3 schema -struct<count(a, b):bigint,count(b, a):bigint,count(a, b):bigint> +struct<udf(count(a, b)):string,udf(count(b, a)):string,udf(count(a, b)):string> -- !query 3 output 4 4 4 -- !query 4 SELECT - count(DISTINCT a, b), count(DISTINCT b, a), count(DISTINCT *), count(DISTINCT testData.*) + udf(count(DISTINCT a, b)), udf(count(DISTINCT b, a)), udf(count(DISTINCT *)), udf(count(DISTINCT testData.*)) FROM testData -- !query 4 schema -struct<count(DISTINCT a, b):bigint,count(DISTINCT b, a):bigint,count(DISTINCT a, b):bigint,count(DISTINCT a, b):bigint> +struct<udf(count(distinct a, b)):string,udf(count(distinct b, a)):string,udf(count(distinct a, b)):string,udf(count(distinct a, b)):string> -- !query 4 output 3 3 3 3 ``` </details> ## How was this patch tested? Tested as guided in SPARK-27921. Closes #25089 from vinodkc/br_Fix_SPARK-28275. Authored-by: Vinod KC <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…part2.sql' into UDF test base ## What changes were proposed in this pull request? This PR adds some tests converted from `pgSQL/aggregates_part2.sql'` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). <details><summary>Diff comparing to 'pgSQL/aggregates_part2.sql'</summary> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part2.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part2.sql.out index 2606d2e..00c06f9 100644 --- a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part2.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part2.sql.out -57,23 +57,23 true false true false true true true true true -- !query 3 -select min(unique1) from tenk1 +select min(udf(unique1)) from tenk1 -- !query 3 schema -struct<min(unique1):int> +struct<min(udf(unique1)):string> -- !query 3 output 0 -- !query 4 -select max(unique1) from tenk1 +select udf(max(unique1)) from tenk1 -- !query 4 schema -struct<max(unique1):int> +struct<udf(max(unique1)):string> -- !query 4 output 9999 -- !query 5 -select max(unique1) from tenk1 where unique1 < 42 +select max(unique1) from tenk1 where udf(unique1) < 42 -- !query 5 schema struct<max(unique1):int> -- !query 5 output -81,7 +81,7 struct<max(unique1):int> -- !query 6 -select max(unique1) from tenk1 where unique1 > 42 +select max(unique1) from tenk1 where unique1 > udf(42) -- !query 6 schema struct<max(unique1):int> -- !query 6 output -89,7 +89,7 struct<max(unique1):int> -- !query 7 -select max(unique1) from tenk1 where unique1 > 42000 +select max(unique1) from tenk1 where udf(unique1) > 42000 -- !query 7 schema struct<max(unique1):int> -- !query 7 output -97,7 +97,7 NULL -- !query 8 -select max(tenthous) from tenk1 where thousand = 33 +select max(tenthous) from tenk1 where udf(thousand) = 33 -- !query 8 schema struct<max(tenthous):int> -- !query 8 output -105,7 +105,7 struct<max(tenthous):int> -- !query 9 -select min(tenthous) from tenk1 where thousand = 33 +select min(tenthous) from tenk1 where udf(thousand) = 33 -- !query 9 schema struct<min(tenthous):int> -- !query 9 output -113,15 +113,15 struct<min(tenthous):int> -- !query 10 -select distinct max(unique2) from tenk1 +select distinct max(udf(unique2)) from tenk1 -- !query 10 schema -struct<max(unique2):int> +struct<max(udf(unique2)):string> -- !query 10 output 9999 -- !query 11 -select max(unique2) from tenk1 order by 1 +select max(unique2) from tenk1 order by udf(1) -- !query 11 schema struct<max(unique2):int> -- !query 11 output -129,7 +129,7 struct<max(unique2):int> -- !query 12 -select max(unique2) from tenk1 order by max(unique2) +select max(unique2) from tenk1 order by max(udf(unique2)) -- !query 12 schema struct<max(unique2):int> -- !query 12 output -137,7 +137,7 struct<max(unique2):int> -- !query 13 -select max(unique2) from tenk1 order by max(unique2)+1 +select udf(max(udf(unique2))) from tenk1 order by udf(max(unique2))+1 -- !query 13 schema -struct<max(unique2):int> +struct<udf(max(udf(unique2))):string> -- !query 13 output 9999 -- !query 14 -select t1.max_unique2, g from (select max(unique2) as max_unique2 FROM tenk1) t1 LATERAL VIEW explode(array(1,2,3)) t2 AS g order by g desc +select t1.max_unique2, udf(g) from (select max(udf(unique2)) as max_unique2 FROM tenk1) t1 LATERAL VIEW explode(array(1,2,3)) t2 AS g order by g desc -- !query 14 schema -struct<max_unique2:int,g:int> +struct<max_unique2:string,udf(g):string> -- !query 14 output 9999 3 9999 2 -155,8 +155,8 struct<max_unique2:int,g:int> -- !query 15 -select max(100) from tenk1 +select udf(max(100)) from tenk1 -- !query 15 schema -struct<max(100):int> +struct<udf(max(100)):string> -- !query 15 output 100 ``` </details> ## How was this patch tested? Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). Closes #25086 from imback82/udf_test. Authored-by: Terry Kim <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…UDF test base ## What changes were proposed in this pull request? This PR adds some tests converted from having.sql to test UDFs following the combination guide in [SPARK-27921](url) <details><summary>Diff comparing to 'having.sql'</summary> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/having.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-having.sql.out index d87ee52..7cea2e5 100644 --- a/sql/core/src/test/resources/sql-tests/results/having.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-having.sql.out -16,34 +16,34 struct<> -- !query 1 -SELECT k, sum(v) FROM hav GROUP BY k HAVING sum(v) > 2 +SELECT udf(k) AS k, udf(sum(v)) FROM hav GROUP BY k HAVING udf(sum(v)) > 2 -- !query 1 schema -struct<k:string,sum(v):bigint> +struct<k:string,udf(sum(cast(v as bigint))):string> -- !query 1 output one 6 three 3 -- !query 2 -SELECT count(k) FROM hav GROUP BY v + 1 HAVING v + 1 = 2 +SELECT udf(count(udf(k))) FROM hav GROUP BY v + 1 HAVING v + 1 = udf(2) -- !query 2 schema -struct<count(k):bigint> +struct<udf(count(udf(k))):string> -- !query 2 output 1 -- !query 3 -SELECT MIN(t.v) FROM (SELECT * FROM hav WHERE v > 0) t HAVING(COUNT(1) > 0) +SELECT udf(MIN(t.v)) FROM (SELECT * FROM hav WHERE v > 0) t HAVING(udf(COUNT(udf(1))) > 0) -- !query 3 schema -struct<min(v):int> +struct<udf(min(v)):string> -- !query 3 output 1 -- !query 4 -SELECT a + b FROM VALUES (1L, 2), (3L, 4) AS T(a, b) GROUP BY a + b HAVING a + b > 1 +SELECT udf(a + b) FROM VALUES (1L, 2), (3L, 4) AS T(a, b) GROUP BY a + b HAVING a + b > udf(1) -- !query 4 schema -struct<(a + CAST(b AS BIGINT)):bigint> +struct<udf((a + cast(b as bigint))):string> -- !query 4 output 3 7 ``` </details> ## How was this patch tested? Tested as guided in SPARK-27921. Closes #25093 from huaxingao/spark-28281. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…INUTE|SECOND)' and 'MINUTE TO SECOND' ## What changes were proposed in this pull request? The interval conversion behavior is same with the PostgreSQL. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/interval.sql#L180-L203 ## How was this patch tested? UT. Closes #25000 from lipzhu/SPARK-28107. Lead-authored-by: Zhu, Lipeng <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Co-authored-by: Lipeng Zhu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? This fixes a problem where it is possible to create a v2 table using the default catalog that cannot be loaded with the session catalog. A session catalog should be used when the v1 catalog is responsible for tables with no catalog in the table identifier. * Adds a v2 catalog implementation that delegates to the analyzer's SessionCatalog * Uses the v2 session catalog for CTAS and CreateTable when the provider is a v2 provider and no v2 catalog is in the table identifier * Updates catalog lookup to always provide the default if it is set for consistent behavior ## How was this patch tested? * Adds a new test suite for the v2 session catalog that validates the TableCatalog API * Adds test cases in PlanResolutionSuite to validate the v2 session catalog is used * Adds test suite for LookupCatalog with a default catalog Closes #24768 from rdblue/SPARK-27919-add-v2-session-catalog. Authored-by: Ryan Blue <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… yyyy and yyyy-[m]m formats ## What changes were proposed in this pull request? Fix `stringToDate()` for the formats `yyyy` and `yyyy-[m]m` that assumes there are no additional chars after the last components `yyyy` and `[m]m`. In the PR, I propose to check that entire input was consumed for the formats. After the fix, the input `1999 08 01` will be invalid because it matches to the pattern `yyyy` but the strings contains additional chars ` 08 01`. Since Spark 1.6.3 ~ 2.4.3, the behavior is the same. ``` spark-sql> SELECT CAST('1999 08 01' AS DATE); 1999-01-01 ``` This PR makes it return NULL like Hive. ``` spark-sql> SELECT CAST('1999 08 01' AS DATE); NULL ``` ## How was this patch tested? Added new checks to `DateTimeUtilsSuite` for the `1999 08 01` and `1999 08` inputs. Closes #25097 from MaxGekk/spark-28015-invalid-date-format. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ql' into UDF test base ## What changes were proposed in this pull request? This PR adds some tests converted from `pgSQL/aggregates_part1.sql'` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). This PR also contains two minor fixes: 1. Change name of Scala UDF from `UDF:name(...)` to `name(...)` to be consistent with Python' 2. Fix Scala UDF at `IntegratedUDFTestUtils.scala ` to handle `null` in strings. <details><summary>Diff comparing to 'pgSQL/aggregates_part1.sql'</summary> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part1.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out index 51ca1d5..124fdd6416e 100644 --- a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part1.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out -3,7 +3,7 -- !query 0 -SELECT avg(four) AS avg_1 FROM onek +SELECT avg(udf(four)) AS avg_1 FROM onek -- !query 0 schema struct<avg_1:double> -- !query 0 output -11,15 +11,15 struct<avg_1:double> -- !query 1 -SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100 +SELECT udf(avg(a)) AS avg_32 FROM aggtest WHERE a < 100 -- !query 1 schema -struct<avg_32:double> +struct<avg_32:string> -- !query 1 output 32.666666666666664 -- !query 2 -select CAST(avg(b) AS Decimal(10,3)) AS avg_107_943 FROM aggtest +select CAST(avg(udf(b)) AS Decimal(10,3)) AS avg_107_943 FROM aggtest -- !query 2 schema struct<avg_107_943:decimal(10,3)> -- !query 2 output -27,285 +27,286 struct<avg_107_943:decimal(10,3)> -- !query 3 -SELECT sum(four) AS sum_1500 FROM onek +SELECT sum(udf(four)) AS sum_1500 FROM onek -- !query 3 schema -struct<sum_1500:bigint> +struct<sum_1500:double> -- !query 3 output -1500 +1500.0 -- !query 4 -SELECT sum(a) AS sum_198 FROM aggtest +SELECT udf(sum(a)) AS sum_198 FROM aggtest -- !query 4 schema -struct<sum_198:bigint> +struct<sum_198:string> -- !query 4 output 198 -- !query 5 -SELECT sum(b) AS avg_431_773 FROM aggtest +SELECT udf(udf(sum(b))) AS avg_431_773 FROM aggtest -- !query 5 schema -struct<avg_431_773:double> +struct<avg_431_773:string> -- !query 5 output 431.77260909229517 -- !query 6 -SELECT max(four) AS max_3 FROM onek +SELECT udf(max(four)) AS max_3 FROM onek -- !query 6 schema -struct<max_3:int> +struct<max_3:string> -- !query 6 output 3 -- !query 7 -SELECT max(a) AS max_100 FROM aggtest +SELECT max(udf(a)) AS max_100 FROM aggtest -- !query 7 schema -struct<max_100:int> +struct<max_100:string> -- !query 7 output -100 +56 -- !query 8 -SELECT max(aggtest.b) AS max_324_78 FROM aggtest +SELECT CAST(udf(udf(max(aggtest.b))) AS int) AS max_324_78 FROM aggtest -- !query 8 schema -struct<max_324_78:float> +struct<max_324_78:int> -- !query 8 output -324.78 +324 -- !query 9 -SELECT stddev_pop(b) FROM aggtest +SELECT CAST(stddev_pop(udf(b)) AS int) FROM aggtest -- !query 9 schema -struct<stddev_pop(CAST(b AS DOUBLE)):double> +struct<CAST(stddev_pop(CAST(udf(b) AS DOUBLE)) AS INT):int> -- !query 9 output -131.10703231895047 +131 -- !query 10 -SELECT stddev_samp(b) FROM aggtest +SELECT udf(stddev_samp(b)) FROM aggtest -- !query 10 schema -struct<stddev_samp(CAST(b AS DOUBLE)):double> +struct<udf(stddev_samp(cast(b as double))):string> -- !query 10 output 151.38936080399804 -- !query 11 -SELECT var_pop(b) FROM aggtest +SELECT CAST(var_pop(udf(b)) as int) FROM aggtest -- !query 11 schema -struct<var_pop(CAST(b AS DOUBLE)):double> +struct<CAST(var_pop(CAST(udf(b) AS DOUBLE)) AS INT):int> -- !query 11 output -17189.053923482323 +17189 -- !query 12 -SELECT var_samp(b) FROM aggtest +SELECT udf(var_samp(b)) FROM aggtest -- !query 12 schema -struct<var_samp(CAST(b AS DOUBLE)):double> +struct<udf(var_samp(cast(b as double))):string> -- !query 12 output 22918.738564643096 -- !query 13 -SELECT stddev_pop(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT udf(stddev_pop(CAST(b AS Decimal(38,0)))) FROM aggtest -- !query 13 schema -struct<stddev_pop(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<udf(stddev_pop(cast(cast(b as decimal(38,0)) as double))):string> -- !query 13 output 131.18117242958306 -- !query 14 -SELECT stddev_samp(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT stddev_samp(CAST(udf(b) AS Decimal(38,0))) FROM aggtest -- !query 14 schema -struct<stddev_samp(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<stddev_samp(CAST(CAST(udf(b) AS DECIMAL(38,0)) AS DOUBLE)):double> -- !query 14 output 151.47497042966097 -- !query 15 -SELECT var_pop(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT udf(var_pop(CAST(b AS Decimal(38,0)))) FROM aggtest -- !query 15 schema -struct<var_pop(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<udf(var_pop(cast(cast(b as decimal(38,0)) as double))):string> -- !query 15 output 17208.5 -- !query 16 -SELECT var_samp(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT var_samp(udf(CAST(b AS Decimal(38,0)))) FROM aggtest -- !query 16 schema -struct<var_samp(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<var_samp(CAST(udf(cast(b as decimal(38,0))) AS DOUBLE)):double> -- !query 16 output 22944.666666666668 -- !query 17 -SELECT var_pop(1.0), var_samp(2.0) +SELECT udf(var_pop(1.0)), var_samp(udf(2.0)) -- !query 17 schema -struct<var_pop(CAST(1.0 AS DOUBLE)):double,var_samp(CAST(2.0 AS DOUBLE)):double> +struct<udf(var_pop(cast(1.0 as double))):string,var_samp(CAST(udf(2.0) AS DOUBLE)):double> -- !query 17 output 0.0 NaN -- !query 18 -SELECT stddev_pop(CAST(3.0 AS Decimal(38,0))), stddev_samp(CAST(4.0 AS Decimal(38,0))) +SELECT stddev_pop(udf(CAST(3.0 AS Decimal(38,0)))), stddev_samp(CAST(udf(4.0) AS Decimal(38,0))) -- !query 18 schema -struct<stddev_pop(CAST(CAST(3.0 AS DECIMAL(38,0)) AS DOUBLE)):double,stddev_samp(CAST(CAST(4.0 AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<stddev_pop(CAST(udf(cast(3.0 as decimal(38,0))) AS DOUBLE)):double,stddev_samp(CAST(CAST(udf(4.0) AS DECIMAL(38,0)) AS DOUBLE)):double> -- !query 18 output 0.0 NaN -- !query 19 -select sum(CAST(null AS int)) from range(1,4) +select sum(udf(CAST(null AS int))) from range(1,4) -- !query 19 schema -struct<sum(CAST(NULL AS INT)):bigint> +struct<sum(CAST(udf(cast(null as int)) AS DOUBLE)):double> -- !query 19 output NULL -- !query 20 -select sum(CAST(null AS long)) from range(1,4) +select sum(udf(CAST(null AS long))) from range(1,4) -- !query 20 schema -struct<sum(CAST(NULL AS BIGINT)):bigint> +struct<sum(CAST(udf(cast(null as bigint)) AS DOUBLE)):double> -- !query 20 output NULL -- !query 21 -select sum(CAST(null AS Decimal(38,0))) from range(1,4) +select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4) -- !query 21 schema -struct<sum(CAST(NULL AS DECIMAL(38,0))):decimal(38,0)> +struct<sum(CAST(udf(cast(null as decimal(38,0))) AS DOUBLE)):double> -- !query 21 output NULL -- !query 22 -select sum(CAST(null AS DOUBLE)) from range(1,4) +select sum(udf(CAST(null AS DOUBLE))) from range(1,4) -- !query 22 schema -struct<sum(CAST(NULL AS DOUBLE)):double> +struct<sum(CAST(udf(cast(null as double)) AS DOUBLE)):double> -- !query 22 output NULL -- !query 23 -select avg(CAST(null AS int)) from range(1,4) +select avg(udf(CAST(null AS int))) from range(1,4) -- !query 23 schema -struct<avg(CAST(NULL AS INT)):double> +struct<avg(CAST(udf(cast(null as int)) AS DOUBLE)):double> -- !query 23 output NULL -- !query 24 -select avg(CAST(null AS long)) from range(1,4) +select avg(udf(CAST(null AS long))) from range(1,4) -- !query 24 schema -struct<avg(CAST(NULL AS BIGINT)):double> +struct<avg(CAST(udf(cast(null as bigint)) AS DOUBLE)):double> -- !query 24 output NULL -- !query 25 -select avg(CAST(null AS Decimal(38,0))) from range(1,4) +select avg(udf(CAST(null AS Decimal(38,0)))) from range(1,4) -- !query 25 schema -struct<avg(CAST(NULL AS DECIMAL(38,0))):decimal(38,4)> +struct<avg(CAST(udf(cast(null as decimal(38,0))) AS DOUBLE)):double> -- !query 25 output NULL -- !query 26 -select avg(CAST(null AS DOUBLE)) from range(1,4) +select avg(udf(CAST(null AS DOUBLE))) from range(1,4) -- !query 26 schema -struct<avg(CAST(NULL AS DOUBLE)):double> +struct<avg(CAST(udf(cast(null as double)) AS DOUBLE)):double> -- !query 26 output NULL -- !query 27 -select sum(CAST('NaN' AS DOUBLE)) from range(1,4) +select sum(CAST(udf('NaN') AS DOUBLE)) from range(1,4) -- !query 27 schema -struct<sum(CAST(NaN AS DOUBLE)):double> +struct<sum(CAST(udf(NaN) AS DOUBLE)):double> -- !query 27 output NaN -- !query 28 -select avg(CAST('NaN' AS DOUBLE)) from range(1,4) +select avg(CAST(udf('NaN') AS DOUBLE)) from range(1,4) -- !query 28 schema -struct<avg(CAST(NaN AS DOUBLE)):double> +struct<avg(CAST(udf(NaN) AS DOUBLE)):double> -- !query 28 output NaN -- !query 29 SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) -FROM (VALUES (CAST('1' AS DOUBLE)), (CAST('Infinity' AS DOUBLE))) v(x) +FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS DOUBLE))) v(x) -- !query 29 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<> -- !query 29 output -Infinity NaN +org.apache.spark.sql.AnalysisException +cannot evaluate expression CAST(udf(1) AS DOUBLE) in inline table definition; line 2 pos 14 -- !query 30 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE)) FROM (VALUES ('Infinity'), ('1')) v(x) -- !query 30 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(udf(x) AS DOUBLE)):double,var_pop(CAST(udf(x) AS DOUBLE)):double> -- !query 30 output Infinity NaN -- !query 31 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE)) FROM (VALUES ('Infinity'), ('Infinity')) v(x) -- !query 31 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(udf(x) AS DOUBLE)):double,var_pop(CAST(udf(x) AS DOUBLE)):double> -- !query 31 output Infinity NaN -- !query 32 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE)) FROM (VALUES ('-Infinity'), ('Infinity')) v(x) -- !query 32 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(udf(x) AS DOUBLE)):double,var_pop(CAST(udf(x) AS DOUBLE)):double> -- !query 32 output NaN NaN -- !query 33 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(udf(CAST(x AS DOUBLE))), udf(var_pop(CAST(x AS DOUBLE))) FROM (VALUES (100000003), (100000004), (100000006), (100000007)) v(x) -- !query 33 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(udf(cast(x as double)) AS DOUBLE)):double,udf(var_pop(cast(x as double))):string> -- !query 33 output 1.00000005E8 2.5 -- !query 34 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(udf(CAST(x AS DOUBLE))), udf(var_pop(CAST(x AS DOUBLE))) FROM (VALUES (7000000000005), (7000000000007)) v(x) -- !query 34 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(udf(cast(x as double)) AS DOUBLE)):double,udf(var_pop(cast(x as double))):string> -- !query 34 output 7.000000000006E12 1.0 -- !query 35 -SELECT covar_pop(b, a), covar_samp(b, a) FROM aggtest +SELECT CAST(udf(covar_pop(b, udf(a))) AS int), CAST(covar_samp(udf(b), a) as int) FROM aggtest -- !query 35 schema -struct<covar_pop(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double,covar_samp(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double> +struct<CAST(udf(covar_pop(cast(b as double), cast(udf(a) as double))) AS INT):int,CAST(covar_samp(CAST(udf(b) AS DOUBLE), CAST(a AS DOUBLE)) AS INT):int> -- !query 35 output -653.6289553875104 871.5052738500139 +653 871 -- !query 36 -SELECT corr(b, a) FROM aggtest +SELECT corr(b, udf(a)) FROM aggtest -- !query 36 schema -struct<corr(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double> +struct<corr(CAST(b AS DOUBLE), CAST(udf(a) AS DOUBLE)):double> -- !query 36 output 0.1396345165178734 -- !query 37 -SELECT count(four) AS cnt_1000 FROM onek +SELECT count(udf(four)) AS cnt_1000 FROM onek -- !query 37 schema struct<cnt_1000:bigint> -- !query 37 output -313,36 +314,36 struct<cnt_1000:bigint> -- !query 38 -SELECT count(DISTINCT four) AS cnt_4 FROM onek +SELECT udf(count(DISTINCT four)) AS cnt_4 FROM onek -- !query 38 schema -struct<cnt_4:bigint> +struct<cnt_4:string> -- !query 38 output 4 -- !query 39 -select ten, count(*), sum(four) from onek +select ten, udf(count(*)), sum(udf(four)) from onek group by ten order by ten -- !query 39 schema -struct<ten:int,count(1):bigint,sum(four):bigint> +struct<ten:int,udf(count(1)):string,sum(CAST(udf(four) AS DOUBLE)):double> -- !query 39 output -0 100 100 -1 100 200 -2 100 100 -3 100 200 -4 100 100 -5 100 200 -6 100 100 -7 100 200 -8 100 100 -9 100 200 +0 100 100.0 +1 100 200.0 +2 100 100.0 +3 100 200.0 +4 100 100.0 +5 100 200.0 +6 100 100.0 +7 100 200.0 +8 100 100.0 +9 100 200.0 -- !query 40 -select ten, count(four), sum(DISTINCT four) from onek +select ten, count(udf(four)), udf(sum(DISTINCT four)) from onek group by ten order by ten -- !query 40 schema -struct<ten:int,count(four):bigint,sum(DISTINCT four):bigint> +struct<ten:int,count(udf(four)):bigint,udf(sum(distinct cast(four as bigint))):string> -- !query 40 output 0 100 2 1 100 4 -357,11 +358,11 struct<ten:int,count(four):bigint,sum(DISTINCT four):bigint> -- !query 41 -select ten, sum(distinct four) from onek a +select ten, udf(sum(distinct four)) from onek a group by ten -having exists (select 1 from onek b where sum(distinct a.four) = b.four) +having exists (select 1 from onek b where udf(sum(distinct a.four)) = b.four) -- !query 41 schema -struct<ten:int,sum(DISTINCT four):bigint> +struct<ten:int,udf(sum(distinct cast(four as bigint))):string> -- !query 41 output 0 2 2 2 -374,23 +375,23 struct<ten:int,sum(DISTINCT four):bigint> select ten, sum(distinct four) from onek a group by ten having exists (select 1 from onek b - where sum(distinct a.four + b.four) = b.four) + where sum(distinct a.four + b.four) = udf(b.four)) -- !query 42 schema struct<> -- !query 42 output org.apache.spark.sql.AnalysisException Aggregate/Window/Generate expressions are not valid in where clause of the query. -Expression in where clause: [(sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT)) = CAST(b.`four` AS BIGINT))] +Expression in where clause: [(sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT)) = CAST(udf(four) AS BIGINT))] Invalid expressions: [sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT))]; -- !query 43 select - (select max((select i.unique2 from tenk1 i where i.unique1 = o.unique1))) + (select udf(max((select i.unique2 from tenk1 i where i.unique1 = o.unique1)))) from tenk1 o -- !query 43 schema struct<> -- !query 43 output org.apache.spark.sql.AnalysisException -cannot resolve '`o.unique1`' given input columns: [i.even, i.fivethous, i.four, i.hundred, i.odd, i.string4, i.stringu1, i.stringu2, i.ten, i.tenthous, i.thousand, i.twenty, i.two, i.twothousand, i.unique1, i.unique2]; line 2 pos 63 +cannot resolve '`o.unique1`' given input columns: [i.even, i.fivethous, i.four, i.hundred, i.odd, i.string4, i.stringu1, i.stringu2, i.ten, i.tenthous, i.thousand, i.twenty, i.two, i.twothousand, i.unique1, i.unique2]; line 2 pos 67 ``` </details> Note that, currently, `IntegratedUDFTestUtils.scala`'s UDFs only return strings. There are some differences between those UDFs (Scala, Pandas and Python): - Python's string representation of floats can make the tests flaky. (See https://docs.python.org/3/tutorial/floatingpoint.html). To work around this, I had to `CAST(... as int)`. - There are string representation differences between `Inf` `-Inf` <> `Infinity` `-Infinity` and `nan` <> `NaN` - Maybe we should add other type versions of UDFs if this makes adding tests difficult. Note that one issue found - [SPARK-28291](https://issues.apache.org/jira/browse/SPARK-28291). The test was commented for now. ## How was this patch tested? Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). Closes #25069 from HyukjinKwon/SPARK-28270. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…name ## What changes were proposed in this pull request? The new adaptive execution framework introduced configuration `spark.sql.runtime.reoptimization.enabled`. We now rename it back to `spark.sql.adaptive.enabled` as the umbrella configuration for adaptive execution. ## How was this patch tested? Existing tests. Closes #25102 from carsonwang/renameAE. Authored-by: Carson Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…tgresSQL SQL tests ## What changes were proposed in this pull request? This PR proposes to replace `REL_12_BETA1` to `REL_12_BETA2` which is latest. ## How was this patch tested? Manually checked each link and checked via `git grep -r REL_12_BETA1` as well. Closes #25105 from HyukjinKwon/SPARK-28342. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? The optimizer rule `NormalizeFloatingNumbers` is not idempotent. It will generate multiple `NormalizeNaNAndZero` and `ArrayTransform` expression nodes for multiple runs. This patch fixed this non-idempotence by adding a marking tag above normalized expressions. It also adds missing UTs for `NormalizeFloatingNumbers`. ## How was this patch tested? New UTs. Closes #25080 from yeshengm/spark-28306. Authored-by: Yesheng Ma <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…n udf-aggregates_part1.sql to avoid Python float limitation ## What changes were proposed in this pull request? The tests added at #25069 seem flaky in some environments. See #25069 (comment) Python's string representation of floats can make the tests flaky. See https://docs.python.org/3/tutorial/floatingpoint.html. I think it's just better to explicitly cast everywhere udf returns a float (or a double) to stay safe. (note that we're not targeting the Python <> Scala value conversions - there are inevitable differences between Python and Scala; therefore, other languages' UDFs cannot guarantee the same results between Python and Scala). This PR proposes to cast cases to long, integer and decimal explicitly to make the test cases robust. <details><summary>Diff comparing to 'pgSQL/aggregates_part1.sql'</summary> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part1.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out index 51ca1d5..734634b7388 100644 --- a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part1.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out -3,23 +3,23 -- !query 0 -SELECT avg(four) AS avg_1 FROM onek +SELECT CAST(avg(udf(four)) AS decimal(10,3)) AS avg_1 FROM onek -- !query 0 schema -struct<avg_1:double> +struct<avg_1:decimal(10,3)> -- !query 0 output 1.5 -- !query 1 -SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100 +SELECT CAST(udf(avg(a)) AS decimal(10,3)) AS avg_32 FROM aggtest WHERE a < 100 -- !query 1 schema -struct<avg_32:double> +struct<avg_32:decimal(10,3)> -- !query 1 output -32.666666666666664 +32.667 -- !query 2 -select CAST(avg(b) AS Decimal(10,3)) AS avg_107_943 FROM aggtest +select CAST(avg(udf(b)) AS Decimal(10,3)) AS avg_107_943 FROM aggtest -- !query 2 schema struct<avg_107_943:decimal(10,3)> -- !query 2 output -27,39 +27,39 struct<avg_107_943:decimal(10,3)> -- !query 3 -SELECT sum(four) AS sum_1500 FROM onek +SELECT CAST(sum(udf(four)) AS int) AS sum_1500 FROM onek -- !query 3 schema -struct<sum_1500:bigint> +struct<sum_1500:int> -- !query 3 output 1500 -- !query 4 -SELECT sum(a) AS sum_198 FROM aggtest +SELECT udf(sum(a)) AS sum_198 FROM aggtest -- !query 4 schema -struct<sum_198:bigint> +struct<sum_198:string> -- !query 4 output 198 -- !query 5 -SELECT sum(b) AS avg_431_773 FROM aggtest +SELECT CAST(udf(udf(sum(b))) AS decimal(10,3)) AS avg_431_773 FROM aggtest -- !query 5 schema -struct<avg_431_773:double> +struct<avg_431_773:decimal(10,3)> -- !query 5 output -431.77260909229517 +431.773 -- !query 6 -SELECT max(four) AS max_3 FROM onek +SELECT udf(max(four)) AS max_3 FROM onek -- !query 6 schema -struct<max_3:int> +struct<max_3:string> -- !query 6 output 3 -- !query 7 -SELECT max(a) AS max_100 FROM aggtest +SELECT max(CAST(udf(a) AS int)) AS max_100 FROM aggtest -- !query 7 schema struct<max_100:int> -- !query 7 output -67,245 +67,246 struct<max_100:int> -- !query 8 -SELECT max(aggtest.b) AS max_324_78 FROM aggtest +SELECT CAST(udf(udf(max(aggtest.b))) AS decimal(10,3)) AS max_324_78 FROM aggtest -- !query 8 schema -struct<max_324_78:float> +struct<max_324_78:decimal(10,3)> -- !query 8 output 324.78 -- !query 9 -SELECT stddev_pop(b) FROM aggtest +SELECT CAST(stddev_pop(udf(b)) AS decimal(10,3)) FROM aggtest -- !query 9 schema -struct<stddev_pop(CAST(b AS DOUBLE)):double> +struct<CAST(stddev_pop(CAST(udf(b) AS DOUBLE)) AS DECIMAL(10,3)):decimal(10,3)> -- !query 9 output -131.10703231895047 +131.107 -- !query 10 -SELECT stddev_samp(b) FROM aggtest +SELECT CAST(udf(stddev_samp(b)) AS decimal(10,3)) FROM aggtest -- !query 10 schema -struct<stddev_samp(CAST(b AS DOUBLE)):double> +struct<CAST(udf(stddev_samp(cast(b as double))) AS DECIMAL(10,3)):decimal(10,3)> -- !query 10 output -151.38936080399804 +151.389 -- !query 11 -SELECT var_pop(b) FROM aggtest +SELECT CAST(var_pop(udf(b)) AS decimal(10,3)) FROM aggtest -- !query 11 schema -struct<var_pop(CAST(b AS DOUBLE)):double> +struct<CAST(var_pop(CAST(udf(b) AS DOUBLE)) AS DECIMAL(10,3)):decimal(10,3)> -- !query 11 output -17189.053923482323 +17189.054 -- !query 12 -SELECT var_samp(b) FROM aggtest +SELECT CAST(udf(var_samp(b)) AS decimal(10,3)) FROM aggtest -- !query 12 schema -struct<var_samp(CAST(b AS DOUBLE)):double> +struct<CAST(udf(var_samp(cast(b as double))) AS DECIMAL(10,3)):decimal(10,3)> -- !query 12 output -22918.738564643096 +22918.739 -- !query 13 -SELECT stddev_pop(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT CAST(udf(stddev_pop(CAST(b AS Decimal(38,0)))) AS decimal(10,3)) FROM aggtest -- !query 13 schema -struct<stddev_pop(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<CAST(udf(stddev_pop(cast(cast(b as decimal(38,0)) as double))) AS DECIMAL(10,3)):decimal(10,3)> -- !query 13 output -131.18117242958306 +131.181 -- !query 14 -SELECT stddev_samp(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT CAST(stddev_samp(CAST(udf(b) AS Decimal(38,0))) AS decimal(10,3)) FROM aggtest -- !query 14 schema -struct<stddev_samp(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<CAST(stddev_samp(CAST(CAST(udf(b) AS DECIMAL(38,0)) AS DOUBLE)) AS DECIMAL(10,3)):decimal(10,3)> -- !query 14 output -151.47497042966097 +151.475 -- !query 15 -SELECT var_pop(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT CAST(udf(var_pop(CAST(b AS Decimal(38,0)))) AS decimal(10,3)) FROM aggtest -- !query 15 schema -struct<var_pop(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<CAST(udf(var_pop(cast(cast(b as decimal(38,0)) as double))) AS DECIMAL(10,3)):decimal(10,3)> -- !query 15 output 17208.5 -- !query 16 -SELECT var_samp(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT CAST(var_samp(udf(CAST(b AS Decimal(38,0)))) AS decimal(10,3)) FROM aggtest -- !query 16 schema -struct<var_samp(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<CAST(var_samp(CAST(udf(cast(b as decimal(38,0))) AS DOUBLE)) AS DECIMAL(10,3)):decimal(10,3)> -- !query 16 output -22944.666666666668 +22944.667 -- !query 17 -SELECT var_pop(1.0), var_samp(2.0) +SELECT CAST(udf(var_pop(1.0)) AS int), var_samp(udf(2.0)) -- !query 17 schema -struct<var_pop(CAST(1.0 AS DOUBLE)):double,var_samp(CAST(2.0 AS DOUBLE)):double> +struct<CAST(udf(var_pop(cast(1.0 as double))) AS INT):int,var_samp(CAST(udf(2.0) AS DOUBLE)):double> -- !query 17 output -0.0 NaN +0 NaN -- !query 18 -SELECT stddev_pop(CAST(3.0 AS Decimal(38,0))), stddev_samp(CAST(4.0 AS Decimal(38,0))) +SELECT CAST(stddev_pop(udf(CAST(3.0 AS Decimal(38,0)))) AS int), stddev_samp(CAST(udf(4.0) AS Decimal(38,0))) -- !query 18 schema -struct<stddev_pop(CAST(CAST(3.0 AS DECIMAL(38,0)) AS DOUBLE)):double,stddev_samp(CAST(CAST(4.0 AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<CAST(stddev_pop(CAST(udf(cast(3.0 as decimal(38,0))) AS DOUBLE)) AS INT):int,stddev_samp(CAST(CAST(udf(4.0) AS DECIMAL(38,0)) AS DOUBLE)):double> -- !query 18 output -0.0 NaN +0 NaN -- !query 19 -select sum(CAST(null AS int)) from range(1,4) +select sum(udf(CAST(null AS int))) from range(1,4) -- !query 19 schema -struct<sum(CAST(NULL AS INT)):bigint> +struct<sum(CAST(udf(cast(null as int)) AS DOUBLE)):double> -- !query 19 output NULL -- !query 20 -select sum(CAST(null AS long)) from range(1,4) +select sum(udf(CAST(null AS long))) from range(1,4) -- !query 20 schema -struct<sum(CAST(NULL AS BIGINT)):bigint> +struct<sum(CAST(udf(cast(null as bigint)) AS DOUBLE)):double> -- !query 20 output NULL -- !query 21 -select sum(CAST(null AS Decimal(38,0))) from range(1,4) +select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4) -- !query 21 schema -struct<sum(CAST(NULL AS DECIMAL(38,0))):decimal(38,0)> +struct<sum(CAST(udf(cast(null as decimal(38,0))) AS DOUBLE)):double> -- !query 21 output NULL -- !query 22 -select sum(CAST(null AS DOUBLE)) from range(1,4) +select sum(udf(CAST(null AS DOUBLE))) from range(1,4) -- !query 22 schema -struct<sum(CAST(NULL AS DOUBLE)):double> +struct<sum(CAST(udf(cast(null as double)) AS DOUBLE)):double> -- !query 22 output NULL -- !query 23 -select avg(CAST(null AS int)) from range(1,4) +select avg(udf(CAST(null AS int))) from range(1,4) -- !query 23 schema -struct<avg(CAST(NULL AS INT)):double> +struct<avg(CAST(udf(cast(null as int)) AS DOUBLE)):double> -- !query 23 output NULL -- !query 24 -select avg(CAST(null AS long)) from range(1,4) +select avg(udf(CAST(null AS long))) from range(1,4) -- !query 24 schema -struct<avg(CAST(NULL AS BIGINT)):double> +struct<avg(CAST(udf(cast(null as bigint)) AS DOUBLE)):double> -- !query 24 output NULL -- !query 25 -select avg(CAST(null AS Decimal(38,0))) from range(1,4) +select avg(udf(CAST(null AS Decimal(38,0)))) from range(1,4) -- !query 25 schema -struct<avg(CAST(NULL AS DECIMAL(38,0))):decimal(38,4)> +struct<avg(CAST(udf(cast(null as decimal(38,0))) AS DOUBLE)):double> -- !query 25 output NULL -- !query 26 -select avg(CAST(null AS DOUBLE)) from range(1,4) +select avg(udf(CAST(null AS DOUBLE))) from range(1,4) -- !query 26 schema -struct<avg(CAST(NULL AS DOUBLE)):double> +struct<avg(CAST(udf(cast(null as double)) AS DOUBLE)):double> -- !query 26 output NULL -- !query 27 -select sum(CAST('NaN' AS DOUBLE)) from range(1,4) +select sum(CAST(udf('NaN') AS DOUBLE)) from range(1,4) -- !query 27 schema -struct<sum(CAST(NaN AS DOUBLE)):double> +struct<sum(CAST(udf(NaN) AS DOUBLE)):double> -- !query 27 output NaN -- !query 28 -select avg(CAST('NaN' AS DOUBLE)) from range(1,4) +select avg(CAST(udf('NaN') AS DOUBLE)) from range(1,4) -- !query 28 schema -struct<avg(CAST(NaN AS DOUBLE)):double> +struct<avg(CAST(udf(NaN) AS DOUBLE)):double> -- !query 28 output NaN -- !query 30 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE)) FROM (VALUES ('Infinity'), ('1')) v(x) -- !query 30 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(udf(x) AS DOUBLE)):double,var_pop(CAST(udf(x) AS DOUBLE)):double> -- !query 30 output Infinity NaN -- !query 31 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE)) FROM (VALUES ('Infinity'), ('Infinity')) v(x) -- !query 31 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(udf(x) AS DOUBLE)):double,var_pop(CAST(udf(x) AS DOUBLE)):double> -- !query 31 output Infinity NaN -- !query 32 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE)) FROM (VALUES ('-Infinity'), ('Infinity')) v(x) -- !query 32 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(udf(x) AS DOUBLE)):double,var_pop(CAST(udf(x) AS DOUBLE)):double> -- !query 32 output NaN NaN -- !query 33 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS int), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3)) FROM (VALUES (100000003), (100000004), (100000006), (100000007)) v(x) -- !query 33 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<CAST(avg(CAST(udf(cast(x as double)) AS DOUBLE)) AS INT):int,CAST(udf(var_pop(cast(x as double))) AS DECIMAL(10,3)):decimal(10,3)> -- !query 33 output -1.00000005E8 2.5 +100000005 2.5 -- !query 34 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS long), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3)) FROM (VALUES (7000000000005), (7000000000007)) v(x) -- !query 34 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<CAST(avg(CAST(udf(cast(x as double)) AS DOUBLE)) AS BIGINT):bigint,CAST(udf(var_pop(cast(x as double))) AS DECIMAL(10,3)):decimal(10,3)> -- !query 34 output -7.000000000006E12 1.0 +7000000000006 1 -- !query 35 -SELECT covar_pop(b, a), covar_samp(b, a) FROM aggtest +SELECT CAST(udf(covar_pop(b, udf(a))) AS decimal(10,3)), CAST(covar_samp(udf(b), a) as decimal(10,3)) FROM aggtest -- !query 35 schema -struct<covar_pop(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double,covar_samp(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double> +struct<CAST(udf(covar_pop(cast(b as double), cast(udf(a) as double))) AS DECIMAL(10,3)):decimal(10,3),CAST(covar_samp(CAST(udf(b) AS DOUBLE), CAST(a AS DOUBLE)) AS DECIMAL(10,3)):decimal(10,3)> -- !query 35 output -653.6289553875104 871.5052738500139 +653.629 871.505 -- !query 36 -SELECT corr(b, a) FROM aggtest +SELECT CAST(corr(b, udf(a)) AS decimal(10,3)) FROM aggtest -- !query 36 schema -struct<corr(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double> +struct<CAST(corr(CAST(b AS DOUBLE), CAST(udf(a) AS DOUBLE)) AS DECIMAL(10,3)):decimal(10,3)> -- !query 36 output -0.1396345165178734 +0.14 -- !query 37 -SELECT count(four) AS cnt_1000 FROM onek +SELECT count(udf(four)) AS cnt_1000 FROM onek -- !query 37 schema struct<cnt_1000:bigint> -- !query 37 output -313,18 +314,18 struct<cnt_1000:bigint> -- !query 38 -SELECT count(DISTINCT four) AS cnt_4 FROM onek +SELECT udf(count(DISTINCT four)) AS cnt_4 FROM onek -- !query 38 schema -struct<cnt_4:bigint> +struct<cnt_4:string> -- !query 38 output 4 -- !query 39 -select ten, count(*), sum(four) from onek +select ten, udf(count(*)), CAST(sum(udf(four)) AS int) from onek group by ten order by ten -- !query 39 schema -struct<ten:int,count(1):bigint,sum(four):bigint> +struct<ten:int,udf(count(1)):string,CAST(sum(CAST(udf(four) AS DOUBLE)) AS INT):int> -- !query 39 output 0 100 100 1 100 200 -339,10 +340,10 struct<ten:int,count(1):bigint,sum(four):bigint> -- !query 40 -select ten, count(four), sum(DISTINCT four) from onek +select ten, count(udf(four)), udf(sum(DISTINCT four)) from onek group by ten order by ten -- !query 40 schema -struct<ten:int,count(four):bigint,sum(DISTINCT four):bigint> +struct<ten:int,count(udf(four)):bigint,udf(sum(distinct cast(four as bigint))):string> -- !query 40 output 0 100 2 1 100 4 -357,11 +358,11 struct<ten:int,count(four):bigint,sum(DISTINCT four):bigint> -- !query 41 -select ten, sum(distinct four) from onek a +select ten, udf(sum(distinct four)) from onek a group by ten -having exists (select 1 from onek b where sum(distinct a.four) = b.four) +having exists (select 1 from onek b where udf(sum(distinct a.four)) = b.four) -- !query 41 schema -struct<ten:int,sum(DISTINCT four):bigint> +struct<ten:int,udf(sum(distinct cast(four as bigint))):string> -- !query 41 output 0 2 2 2 -374,23 +375,23 struct<ten:int,sum(DISTINCT four):bigint> select ten, sum(distinct four) from onek a group by ten having exists (select 1 from onek b - where sum(distinct a.four + b.four) = b.four) + where sum(distinct a.four + b.four) = udf(b.four)) -- !query 42 schema struct<> -- !query 42 output org.apache.spark.sql.AnalysisException Aggregate/Window/Generate expressions are not valid in where clause of the query. -Expression in where clause: [(sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT)) = CAST(b.`four` AS BIGINT))] +Expression in where clause: [(sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT)) = CAST(udf(four) AS BIGINT))] Invalid expressions: [sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT))]; -- !query 43 select - (select max((select i.unique2 from tenk1 i where i.unique1 = o.unique1))) + (select udf(max((select i.unique2 from tenk1 i where i.unique1 = o.unique1)))) from tenk1 o -- !query 43 schema struct<> -- !query 43 output org.apache.spark.sql.AnalysisException -cannot resolve '`o.unique1`' given input columns: [i.even, i.fivethous, i.four, i.hundred, i.odd, i.string4, i.stringu1, i.stringu2, i.ten, i.tenthous, i.thousand, i.twenty, i.two, i.twothousand, i.unique1, i.unique2]; line 2 pos 63 +cannot resolve '`o.unique1`' given input columns: [i.even, i.fivethous, i.four, i.hundred, i.odd, i.string4, i.stringu1, i.stringu2, i.ten, i.tenthous, i.thousand, i.twenty, i.two, i.twothousand, i.unique1, i.unique2]; line 2 pos 67 ``` </details> ## How was this patch tested? Manually tested in local. Also, with JDK 11: ``` Using /.../jdk-11.0.3.jdk/Contents/Home as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. [info] Loading project definition from /.../spark/project [info] Updating {file:/.../spark/project/}spark-build... ... [info] SQLQueryTestSuite: ... [info] - udf/pgSQL/udf-aggregates_part1.sql - Scala UDF (17 seconds, 228 milliseconds) [info] - udf/pgSQL/udf-aggregates_part1.sql - Regular Python UDF (36 seconds, 170 milliseconds) [info] - udf/pgSQL/udf-aggregates_part1.sql - Scalar Pandas UDF (41 seconds, 132 milliseconds) ... ``` Closes #25110 from HyukjinKwon/SPARK-28270-1. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…umnar ## What changes were proposed in this pull request? This is a second part of the https://issues.apache.org/jira/browse/SPARK-27396 and a follow on to #24795 ## How was this patch tested? I did some manual tests and ran/updated the automated tests I did some simple performance tests on a single node to try to verify that there is no performance impact, and I was not able to measure anything beyond noise. Closes #25008 from revans2/columnar-remove-batch-scan. Authored-by: Robert (Bobby) Evans <[email protected]> Signed-off-by: Thomas Graves <[email protected]>

## What changes were proposed in this pull request? Cleaned up (removed) code duplication in `ObjectProducerExec` operators so they use the trait's methods. ## How was this patch tested? Local build. Waiting for Jenkins. Closes #25065 from jaceklaskowski/ObjectProducerExec-operators-cleanup. Authored-by: Jacek Laskowski <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…le expressions ## What changes were proposed in this pull request? Reverted initialization of date-time constants in `DateTimeUtils` introduced by #23878. As a comment in [Delta repo](https://github.com/delta-io/delta) states, the compiler can do additional optimizations if values can be calculated at compile time: https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/util/DateTimeUtils.scala#L63-L75 ## How was this patch tested? This was tested by existing test suites. Closes #25116 from MaxGekk/datetime-consts-init. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: herman <[email protected]>

…onfigurations. ## What changes were proposed in this pull request? At the moment Kafka delegation tokens are fetched through `AdminClient` but there is no possibility to add custom configuration parameters. In [options](https://spark.apache.org/docs/2.4.3/structured-streaming-kafka-integration.html#kafka-specific-configurations) there is already a possibility to add custom configurations. In this PR I've added similar this possibility to `AdminClient`. ## How was this patch tested? Existing + added unit tests. ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24875 from gaborgsomogyi/SPARK-28055. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

## What changes were proposed in this pull request? This PR adds two new config properties: `spark.driver.defaultJavaOptions` and `spark.executor.defaultJavaOptions`. These are intended to be set by administrators in a file of defaults for options like JVM garbage collection algorithm. Users will still set `extraJavaOptions` properties, and both sets of JVM options will be added to start a JVM (default options are prepended to extra options). ## How was this patch tested? Existing + additional unit tests. ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24804 from gaborgsomogyi/SPARK-23472. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

## What changes were proposed in this pull request? This PR is to port select.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/select.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/select.out When porting the test cases, found four PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28010](https://issues.apache.org/jira/browse/SPARK-28010): Support ORDER BY ... USING syntax [SPARK-28329](https://issues.apache.org/jira/browse/SPARK-28329): Support SELECT INTO syntax [SPARK-28330](https://issues.apache.org/jira/browse/SPARK-28330): Enhance query limit [SPARK-28296](https://issues.apache.org/jira/browse/SPARK-28296): Improved VALUES support Also, found one inconsistent behavior: [SPARK-28333](https://issues.apache.org/jira/browse/SPARK-28333): `NULLS FIRST` for `DESC` and `NULLS LAST` for `ASC` ## How was this patch tested? N/A Closes #25096 from wangyum/SPARK-28334. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? Implement `ALTER TABLE` for v2 tables: * Add `AlterTable` logical plan and `AlterTableExec` physical plan * Convert `ALTER TABLE` parsed plans to `AlterTable` when a v2 catalog is responsible for an identifier * Validate that columns to alter exist in analyzer checks * Fix nested type handling in `CatalogV2Util` ## How was this patch tested? * Add extensive tests in `DataSourceV2SQLSuite` Closes #24937 from rdblue/SPARK-28139-add-v2-alter-table. Lead-authored-by: Ryan Blue <[email protected]> Co-authored-by: Ryan Blue <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? Add 4 additional agg to KeyValueGroupedDataset ## How was this patch tested? New test in DatasetSuite for typed aggregation Closes #24993 from nooberfsh/sqlagg. Authored-by: nooberfsh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

This change adds a new option that enables dynamic allocation without the need for a shuffle service. This mode works by tracking which stages generate shuffle files, and keeping executors that generate data for those shuffles alive while the jobs that use them are active. A separate timeout is also added for shuffle data; so that executors that hold shuffle data can use a separate timeout before being removed because of being idle. This allows the shuffle data to be kept around in case it is needed by some new job, or allow users to be more aggressive in timing out executors that don't have shuffle data in active use. The code also hooks up to the context cleaner so that shuffles that are garbage collected are detected, and the respective executors not held unnecessarily. Testing done with added unit tests, and also with TPC-DS workloads on YARN without a shuffle service. Closes #24817 from vanzin/SPARK-27963. Authored-by: Marcelo Vanzin <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

…ution_listener_on_collect' ## What changes were proposed in this pull request? It fixes a flaky test: ``` ERROR [0.164s]: test_query_execution_listener_on_collect (pyspark.sql.tests.test_dataframe.QueryExecutionListenerTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 758, in test_query_execution_listener_on_collect "The callback from the query execution listener should be called after 'collect'") AssertionError: The callback from the query execution listener should be called after 'collect' ``` Seems it can be failed because the event was somehow delayed but checked first. ## How was this patch tested? Manually. Closes #25177 from HyukjinKwon/SPARK-28418. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

… making UDFs (virtually) no-op ## What changes were proposed in this pull request? Current UDFs available in `IntegratedUDFTestUtils` are not exactly no-op. It converts input column to strings and outputs to strings. It causes some issues when we convert and port the tests at SPARK-27921. Integrated UDF test cases share one output file and it should outputs the same. However, 1. Special values are converted into strings differently: | Scala | Python | | ---------- | ------ | | `null` | `None` | | `Infinity` | `inf` | | `-Infinity`| `-inf` | | `NaN` | `nan` | 2. Due to float limitation at Python (see https://docs.python.org/3/tutorial/floatingpoint.html), if float is passed into Python and sent back to JVM, the values are potentially not exactly correct. See #25128 and #25110 To work around this, this PR targets to change the current UDF to be wrapped by cast. So, Input column is casted into string, UDF returns strings as are, and then output column is casted back to the input column. Roughly: **Before:** ``` JVM (col1) -> (cast to string within Python) Python (string) -> (string) JVM ``` **After:** ``` JVM (cast col1 to string) -> (string) Python (string) -> (cast back to col1's type) JVM ``` In this way, UDF is virtually no-op although there might be some subtleties due to roundtrip in string cast. I believe this is good enough. Python native functions and Scala native functions will take strings and output strings as are. So, there will be no potential test failures due to differences of conversion between Python and Scala. After this fix, for instance, `udf-aggregates_part1.sql` outputs exactly same as `aggregates_part1.sql`: <details><summary>Diff comparing to 'pgSQL/aggregates_part1.sql'</summary> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part1.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out index 51ca1d5..801735781c7 100644 --- a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part1.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out -3,7 +3,7 -- !query 0 -SELECT avg(four) AS avg_1 FROM onek +SELECT avg(udf(four)) AS avg_1 FROM onek -- !query 0 schema struct<avg_1:double> -- !query 0 output -11,7 +11,7 struct<avg_1:double> -- !query 1 -SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100 +SELECT udf(avg(a)) AS avg_32 FROM aggtest WHERE a < 100 -- !query 1 schema struct<avg_32:double> -- !query 1 output -19,7 +19,7 struct<avg_32:double> -- !query 2 -select CAST(avg(b) AS Decimal(10,3)) AS avg_107_943 FROM aggtest +select CAST(avg(udf(b)) AS Decimal(10,3)) AS avg_107_943 FROM aggtest -- !query 2 schema struct<avg_107_943:decimal(10,3)> -- !query 2 output -27,7 +27,7 struct<avg_107_943:decimal(10,3)> -- !query 3 -SELECT sum(four) AS sum_1500 FROM onek +SELECT sum(udf(four)) AS sum_1500 FROM onek -- !query 3 schema struct<sum_1500:bigint> -- !query 3 output -35,7 +35,7 struct<sum_1500:bigint> -- !query 4 -SELECT sum(a) AS sum_198 FROM aggtest +SELECT udf(sum(a)) AS sum_198 FROM aggtest -- !query 4 schema struct<sum_198:bigint> -- !query 4 output -43,7 +43,7 struct<sum_198:bigint> -- !query 5 -SELECT sum(b) AS avg_431_773 FROM aggtest +SELECT udf(udf(sum(b))) AS avg_431_773 FROM aggtest -- !query 5 schema struct<avg_431_773:double> -- !query 5 output -51,7 +51,7 struct<avg_431_773:double> -- !query 6 -SELECT max(four) AS max_3 FROM onek +SELECT udf(max(four)) AS max_3 FROM onek -- !query 6 schema struct<max_3:int> -- !query 6 output -59,7 +59,7 struct<max_3:int> -- !query 7 -SELECT max(a) AS max_100 FROM aggtest +SELECT max(udf(a)) AS max_100 FROM aggtest -- !query 7 schema struct<max_100:int> -- !query 7 output -67,7 +67,7 struct<max_100:int> -- !query 8 -SELECT max(aggtest.b) AS max_324_78 FROM aggtest +SELECT udf(udf(max(aggtest.b))) AS max_324_78 FROM aggtest -- !query 8 schema struct<max_324_78:float> -- !query 8 output -75,237 +75,238 struct<max_324_78:float> -- !query 9 -SELECT stddev_pop(b) FROM aggtest +SELECT stddev_pop(udf(b)) FROM aggtest -- !query 9 schema -struct<stddev_pop(CAST(b AS DOUBLE)):double> +struct<stddev_pop(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS DOUBLE)):double> -- !query 9 output 131.10703231895047 -- !query 10 -SELECT stddev_samp(b) FROM aggtest +SELECT udf(stddev_samp(b)) FROM aggtest -- !query 10 schema -struct<stddev_samp(CAST(b AS DOUBLE)):double> +struct<CAST(udf(cast(stddev_samp(cast(b as double)) as string)) AS DOUBLE):double> -- !query 10 output 151.38936080399804 -- !query 11 -SELECT var_pop(b) FROM aggtest +SELECT var_pop(udf(b)) FROM aggtest -- !query 11 schema -struct<var_pop(CAST(b AS DOUBLE)):double> +struct<var_pop(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS DOUBLE)):double> -- !query 11 output 17189.053923482323 -- !query 12 -SELECT var_samp(b) FROM aggtest +SELECT udf(var_samp(b)) FROM aggtest -- !query 12 schema -struct<var_samp(CAST(b AS DOUBLE)):double> +struct<CAST(udf(cast(var_samp(cast(b as double)) as string)) AS DOUBLE):double> -- !query 12 output 22918.738564643096 -- !query 13 -SELECT stddev_pop(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT udf(stddev_pop(CAST(b AS Decimal(38,0)))) FROM aggtest -- !query 13 schema -struct<stddev_pop(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<CAST(udf(cast(stddev_pop(cast(cast(b as decimal(38,0)) as double)) as string)) AS DOUBLE):double> -- !query 13 output 131.18117242958306 -- !query 14 -SELECT stddev_samp(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT stddev_samp(CAST(udf(b) AS Decimal(38,0))) FROM aggtest -- !query 14 schema -struct<stddev_samp(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<stddev_samp(CAST(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS DECIMAL(38,0)) AS DOUBLE)):double> -- !query 14 output 151.47497042966097 -- !query 15 -SELECT var_pop(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT udf(var_pop(CAST(b AS Decimal(38,0)))) FROM aggtest -- !query 15 schema -struct<var_pop(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<CAST(udf(cast(var_pop(cast(cast(b as decimal(38,0)) as double)) as string)) AS DOUBLE):double> -- !query 15 output 17208.5 -- !query 16 -SELECT var_samp(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT var_samp(udf(CAST(b AS Decimal(38,0)))) FROM aggtest -- !query 16 schema -struct<var_samp(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<var_samp(CAST(CAST(udf(cast(cast(b as decimal(38,0)) as string)) AS DECIMAL(38,0)) AS DOUBLE)):double> -- !query 16 output 22944.666666666668 -- !query 17 -SELECT var_pop(1.0), var_samp(2.0) +SELECT udf(var_pop(1.0)), var_samp(udf(2.0)) -- !query 17 schema -struct<var_pop(CAST(1.0 AS DOUBLE)):double,var_samp(CAST(2.0 AS DOUBLE)):double> +struct<CAST(udf(cast(var_pop(cast(1.0 as double)) as string)) AS DOUBLE):double,var_samp(CAST(CAST(udf(cast(2.0 as string)) AS DECIMAL(2,1)) AS DOUBLE)):double> -- !query 17 output 0.0 NaN -- !query 18 -SELECT stddev_pop(CAST(3.0 AS Decimal(38,0))), stddev_samp(CAST(4.0 AS Decimal(38,0))) +SELECT stddev_pop(udf(CAST(3.0 AS Decimal(38,0)))), stddev_samp(CAST(udf(4.0) AS Decimal(38,0))) -- !query 18 schema -struct<stddev_pop(CAST(CAST(3.0 AS DECIMAL(38,0)) AS DOUBLE)):double,stddev_samp(CAST(CAST(4.0 AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<stddev_pop(CAST(CAST(udf(cast(cast(3.0 as decimal(38,0)) as string)) AS DECIMAL(38,0)) AS DOUBLE)):double,stddev_samp(CAST(CAST(CAST(udf(cast(4.0 as string)) AS DECIMAL(2,1)) AS DECIMAL(38,0)) AS DOUBLE)):double> -- !query 18 output 0.0 NaN -- !query 19 -select sum(CAST(null AS int)) from range(1,4) +select sum(udf(CAST(null AS int))) from range(1,4) -- !query 19 schema -struct<sum(CAST(NULL AS INT)):bigint> +struct<sum(CAST(udf(cast(cast(null as int) as string)) AS INT)):bigint> -- !query 19 output NULL -- !query 20 -select sum(CAST(null AS long)) from range(1,4) +select sum(udf(CAST(null AS long))) from range(1,4) -- !query 20 schema -struct<sum(CAST(NULL AS BIGINT)):bigint> +struct<sum(CAST(udf(cast(cast(null as bigint) as string)) AS BIGINT)):bigint> -- !query 20 output NULL -- !query 21 -select sum(CAST(null AS Decimal(38,0))) from range(1,4) +select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4) -- !query 21 schema -struct<sum(CAST(NULL AS DECIMAL(38,0))):decimal(38,0)> +struct<sum(CAST(udf(cast(cast(null as decimal(38,0)) as string)) AS DECIMAL(38,0))):decimal(38,0)> -- !query 21 output NULL -- !query 22 -select sum(CAST(null AS DOUBLE)) from range(1,4) +select sum(udf(CAST(null AS DOUBLE))) from range(1,4) -- !query 22 schema -struct<sum(CAST(NULL AS DOUBLE)):double> +struct<sum(CAST(udf(cast(cast(null as double) as string)) AS DOUBLE)):double> -- !query 22 output NULL -- !query 23 -select avg(CAST(null AS int)) from range(1,4) +select avg(udf(CAST(null AS int))) from range(1,4) -- !query 23 schema -struct<avg(CAST(NULL AS INT)):double> +struct<avg(CAST(udf(cast(cast(null as int) as string)) AS INT)):double> -- !query 23 output NULL -- !query 24 -select avg(CAST(null AS long)) from range(1,4) +select avg(udf(CAST(null AS long))) from range(1,4) -- !query 24 schema -struct<avg(CAST(NULL AS BIGINT)):double> +struct<avg(CAST(udf(cast(cast(null as bigint) as string)) AS BIGINT)):double> -- !query 24 output NULL -- !query 25 -select avg(CAST(null AS Decimal(38,0))) from range(1,4) +select avg(udf(CAST(null AS Decimal(38,0)))) from range(1,4) -- !query 25 schema -struct<avg(CAST(NULL AS DECIMAL(38,0))):decimal(38,4)> +struct<avg(CAST(udf(cast(cast(null as decimal(38,0)) as string)) AS DECIMAL(38,0))):decimal(38,4)> -- !query 25 output NULL -- !query 26 -select avg(CAST(null AS DOUBLE)) from range(1,4) +select avg(udf(CAST(null AS DOUBLE))) from range(1,4) -- !query 26 schema -struct<avg(CAST(NULL AS DOUBLE)):double> +struct<avg(CAST(udf(cast(cast(null as double) as string)) AS DOUBLE)):double> -- !query 26 output NULL -- !query 27 -select sum(CAST('NaN' AS DOUBLE)) from range(1,4) +select sum(CAST(udf('NaN') AS DOUBLE)) from range(1,4) -- !query 27 schema -struct<sum(CAST(NaN AS DOUBLE)):double> +struct<sum(CAST(CAST(udf(cast(NaN as string)) AS STRING) AS DOUBLE)):double> -- !query 27 output NaN -- !query 28 -select avg(CAST('NaN' AS DOUBLE)) from range(1,4) +select avg(CAST(udf('NaN') AS DOUBLE)) from range(1,4) -- !query 28 schema -struct<avg(CAST(NaN AS DOUBLE)):double> +struct<avg(CAST(CAST(udf(cast(NaN as string)) AS STRING) AS DOUBLE)):double> -- !query 28 output NaN -- !query 30 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE)) FROM (VALUES ('Infinity'), ('1')) v(x) -- !query 30 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double,var_pop(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double> -- !query 30 output Infinity NaN -- !query 31 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE)) FROM (VALUES ('Infinity'), ('Infinity')) v(x) -- !query 31 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double,var_pop(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double> -- !query 31 output Infinity NaN -- !query 32 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE)) FROM (VALUES ('-Infinity'), ('Infinity')) v(x) -- !query 32 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double,var_pop(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double> -- !query 32 output NaN NaN -- !query 33 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(udf(CAST(x AS DOUBLE))), udf(var_pop(CAST(x AS DOUBLE))) FROM (VALUES (100000003), (100000004), (100000006), (100000007)) v(x) -- !query 33 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(udf(cast(cast(x as double) as string)) AS DOUBLE)):double,CAST(udf(cast(var_pop(cast(x as double)) as string)) AS DOUBLE):double> -- !query 33 output 1.00000005E8 2.5 -- !query 34 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(udf(CAST(x AS DOUBLE))), udf(var_pop(CAST(x AS DOUBLE))) FROM (VALUES (7000000000005), (7000000000007)) v(x) -- !query 34 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(udf(cast(cast(x as double) as string)) AS DOUBLE)):double,CAST(udf(cast(var_pop(cast(x as double)) as string)) AS DOUBLE):double> -- !query 34 output 7.000000000006E12 1.0 -- !query 35 -SELECT covar_pop(b, a), covar_samp(b, a) FROM aggtest +SELECT udf(covar_pop(b, udf(a))), covar_samp(udf(b), a) FROM aggtest -- !query 35 schema -struct<covar_pop(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double,covar_samp(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double> +struct<CAST(udf(cast(covar_pop(cast(b as double), cast(cast(udf(cast(a as string)) as int) as double)) as string)) AS DOUBLE):double,covar_samp(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS DOUBLE), CAST(a AS DOUBLE)):double> -- !query 35 output 653.6289553875104 871.5052738500139 -- !query 36 -SELECT corr(b, a) FROM aggtest +SELECT corr(b, udf(a)) FROM aggtest -- !query 36 schema -struct<corr(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double> +struct<corr(CAST(b AS DOUBLE), CAST(CAST(udf(cast(a as string)) AS INT) AS DOUBLE)):double> -- !query 36 output 0.1396345165178734 -- !query 37 -SELECT count(four) AS cnt_1000 FROM onek +SELECT count(udf(four)) AS cnt_1000 FROM onek -- !query 37 schema struct<cnt_1000:bigint> -- !query 37 output -313,7 +314,7 struct<cnt_1000:bigint> -- !query 38 -SELECT count(DISTINCT four) AS cnt_4 FROM onek +SELECT udf(count(DISTINCT four)) AS cnt_4 FROM onek -- !query 38 schema struct<cnt_4:bigint> -- !query 38 output -321,10 +322,10 struct<cnt_4:bigint> -- !query 39 -select ten, count(*), sum(four) from onek +select ten, udf(count(*)), sum(udf(four)) from onek group by ten order by ten -- !query 39 schema -struct<ten:int,count(1):bigint,sum(four):bigint> +struct<ten:int,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint,sum(CAST(udf(cast(four as string)) AS INT)):bigint> -- !query 39 output 0 100 100 1 100 200 -339,10 +340,10 struct<ten:int,count(1):bigint,sum(four):bigint> -- !query 40 -select ten, count(four), sum(DISTINCT four) from onek +select ten, count(udf(four)), udf(sum(DISTINCT four)) from onek group by ten order by ten -- !query 40 schema -struct<ten:int,count(four):bigint,sum(DISTINCT four):bigint> +struct<ten:int,count(CAST(udf(cast(four as string)) AS INT)):bigint,CAST(udf(cast(sum(distinct cast(four as bigint)) as string)) AS BIGINT):bigint> -- !query 40 output 0 100 2 1 100 4 -357,11 +358,11 struct<ten:int,count(four):bigint,sum(DISTINCT four):bigint> -- !query 41 -select ten, sum(distinct four) from onek a +select ten, udf(sum(distinct four)) from onek a group by ten -having exists (select 1 from onek b where sum(distinct a.four) = b.four) +having exists (select 1 from onek b where udf(sum(distinct a.four)) = b.four) -- !query 41 schema -struct<ten:int,sum(DISTINCT four):bigint> +struct<ten:int,CAST(udf(cast(sum(distinct cast(four as bigint)) as string)) AS BIGINT):bigint> -- !query 41 output 0 2 2 2 -374,23 +375,23 struct<ten:int,sum(DISTINCT four):bigint> select ten, sum(distinct four) from onek a group by ten having exists (select 1 from onek b - where sum(distinct a.four + b.four) = b.four) + where sum(distinct a.four + b.four) = udf(b.four)) -- !query 42 schema struct<> -- !query 42 output org.apache.spark.sql.AnalysisException Aggregate/Window/Generate expressions are not valid in where clause of the query. -Expression in where clause: [(sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT)) = CAST(b.`four` AS BIGINT))] +Expression in where clause: [(sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT)) = CAST(CAST(udf(cast(four as string)) AS INT) AS BIGINT))] Invalid expressions: [sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT))]; -- !query 43 select - (select max((select i.unique2 from tenk1 i where i.unique1 = o.unique1))) + (select udf(max((select i.unique2 from tenk1 i where i.unique1 = o.unique1)))) from tenk1 o -- !query 43 schema struct<> -- !query 43 output org.apache.spark.sql.AnalysisException -cannot resolve '`o.unique1`' given input columns: [i.even, i.fivethous, i.four, i.hundred, i.odd, i.string4, i.stringu1, i.stringu2, i.ten, i.tenthous, i.thousand, i.twenty, i.two, i.twothousand, i.unique1, i.unique2]; line 2 pos 63 +cannot resolve '`o.unique1`' given input columns: [i.even, i.fivethous, i.four, i.hundred, i.odd, i.string4, i.stringu1, i.stringu2, i.ten, i.tenthous, i.thousand, i.twenty, i.two, i.twothousand, i.unique1, i.unique2]; line 2 pos 67 ``` </details> ## How was this patch tested? Manually tested. Closes #25130 from HyukjinKwon/SPARK-28359. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? PostgreSQL doesn't have `TINYINT`, which would map directly, but `SMALLINT`s are sufficient for uni-directional translation. A side-effect of this fix is that `AggregatedDialect` is now usable with multiple dialects targeting `jdbc:postgresql`, as `PostgresDialect.getJDBCType` no longer throws (for which reason backporting this fix would be lovely): https://github.com/apache/spark/blob/1217996f1574f758d8cccc1c4e3846452d24b35b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/AggregatedDialect.scala#L42 `dialects.flatMap` currently throws on the first attempt to get a JDBC type preventing subsequent dialects in the chain from providing an alternative. ## How was this patch tested? Unit tests. Closes #24845 from mojodna/postgres-byte-type-mapping. Authored-by: Seth Fitzsimmons <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? In the PR, I propose to convert options values to strings by using `to_str()` for the following functions: `from_csv()`, `to_csv()`, `from_json()`, `to_json()`, `schema_of_csv()` and `schema_of_json()`. This will make handling of function options consistent to option handling in `DataFrameReader`/`DataFrameWriter`. For example: ```Python df.select(from_csv(df.value, "s string", {'ignoreLeadingWhiteSpace': True}) ``` ## How was this patch tested? Added an example for `from_csv()` which was tested by: ```Shell ./python/run-tests --testnames pyspark.sql.functions ``` Closes #25182 from MaxGekk/options_to_str. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

## What changes were proposed in this pull request? In the following python code ``` df.write.mode("overwrite").insertInto("table") ``` ```insertInto``` ignores ```mode("overwrite")``` and appends by default. ## How was this patch tested? Add Unit test. Closes #25175 from huaxingao/spark-28411. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…to UDF test base ## What changes were proposed in this pull request? This PR adds some tests converted from `cross-join.sql'` to test UDFs. <details><summary>Diff comparing to 'cross-join.sql'</summary> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/cross-join.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-cross-join.sql.out index 3833c42..11c1e01 100644 --- a/sql/core/src/test/resources/sql-tests/results/cross-join.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-cross-join.sql.out -43,7 +43,7 two 2 two 22 -- !query 3 -SELECT * FROM nt1 cross join nt2 where nt1.k = nt2.k +SELECT * FROM nt1 cross join nt2 where udf(nt1.k) = udf(nt2.k) -- !query 3 schema struct<k:string,v1:int,k:string,v2:int> -- !query 3 output -53,7 +53,7 two 2 two 22 -- !query 4 -SELECT * FROM nt1 cross join nt2 on (nt1.k = nt2.k) +SELECT * FROM nt1 cross join nt2 on (udf(nt1.k) = udf(nt2.k)) -- !query 4 schema struct<k:string,v1:int,k:string,v2:int> -- !query 4 output -63,7 +63,7 two 2 two 22 -- !query 5 -SELECT * FROM nt1 cross join nt2 where nt1.v1 = 1 and nt2.v2 = 22 +SELECT * FROM nt1 cross join nt2 where udf(nt1.v1) = "1" and udf(nt2.v2) = "22" -- !query 5 schema struct<k:string,v1:int,k:string,v2:int> -- !query 5 output -71,12 +71,12 one 1 two 22 -- !query 6 -SELECT a.key, b.key FROM -(SELECT k key FROM nt1 WHERE v1 < 2) a +SELECT udf(a.key), udf(b.key) FROM +(SELECT udf(k) key FROM nt1 WHERE v1 < 2) a CROSS JOIN -(SELECT k key FROM nt2 WHERE v2 = 22) b +(SELECT udf(k) key FROM nt2 WHERE v2 = 22) b -- !query 6 schema -struct<key:string,key:string> +struct<udf(key):string,udf(key):string> -- !query 6 output one two -114,23 +114,29 struct<> -- !query 11 -select * from ((A join B on (a = b)) cross join C) join D on (a = d) +select * from ((A join B on (udf(a) = udf(b))) cross join C) join D on (udf(a) = udf(d)) -- !query 11 schema -struct<a:string,va:int,b:string,vb:int,c:string,vc:int,d:string,vd:int> +struct<> -- !query 11 output -one 1 one 1 one 1 one 1 -one 1 one 1 three 3 one 1 -one 1 one 1 two 2 one 1 -three 3 three 3 one 1 three 3 -three 3 three 3 three 3 three 3 -three 3 three 3 two 2 three 3 -two 2 two 2 one 1 two 2 -two 2 two 2 three 3 two 2 -two 2 two 2 two 2 two 2 +org.apache.spark.sql.AnalysisException +Detected implicit cartesian product for INNER join between logical plans +Filter (udf(a#x) = udf(b#x)) ++- Join Inner + :- Project [k#x AS a#x, v1#x AS va#x] + : +- LocalRelation [k#x, v1#x] + +- Project [k#x AS b#x, v1#x AS vb#x] + +- LocalRelation [k#x, v1#x] +and +Project [k#x AS d#x, v1#x AS vd#x] ++- LocalRelation [k#x, v1#x] +Join condition is missing or trivial. +Either: use the CROSS JOIN syntax to allow cartesian products between these +relations, or: enable implicit cartesian products by setting the configuration +variable spark.sql.crossJoin.enabled=true; -- !query 12 -SELECT * FROM nt1 CROSS JOIN nt2 ON (nt1.k > nt2.k) +SELECT * FROM nt1 CROSS JOIN nt2 ON (udf(nt1.k) > udf(nt2.k)) -- !query 12 schema struct<k:string,v1:int,k:string,v2:int> -- !query 12 output ``` </details> ## How was this patch tested? Added test. Closes #25168 from viirya/SPARK-28276. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…' into UDF test base ## What changes were proposed in this pull request? This PR adds some tests converted from `intersect-all.sql` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). <details><summary>Diff comparing to 'intersect-all.sql'</summary> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-intersect-all.sql.out index 63dd56c..0cb82be 100644 --- a/sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-intersect-all.sql.out -34,11 +34,11 struct<> -- !query 2 -SELECT * FROM tab1 +SELECT udf(k), v FROM tab1 INTERSECT ALL -SELECT * FROM tab2 +SELECT k, udf(v) FROM tab2 -- !query 2 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,v:int> -- !query 2 output 1 2 1 2 -48,11 +48,11 NULL NULL -- !query 3 -SELECT * FROM tab1 +SELECT k, udf(v) FROM tab1 INTERSECT ALL -SELECT * FROM tab1 WHERE k = 1 +SELECT udf(k), v FROM tab1 WHERE udf(k) = 1 -- !query 3 schema -struct<k:int,v:int> +struct<k:int,CAST(udf(cast(v as string)) AS INT):int> -- !query 3 output 1 2 1 2 -61,39 +61,39 struct<k:int,v:int> -- !query 4 -SELECT * FROM tab1 WHERE k > 2 +SELECT udf(k), udf(v) FROM tab1 WHERE k > udf(2) INTERSECT ALL -SELECT * FROM tab2 +SELECT udf(k), udf(v) FROM tab2 -- !query 4 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int> -- !query 4 output -- !query 5 -SELECT * FROM tab1 +SELECT udf(k), v FROM tab1 INTERSECT ALL -SELECT * FROM tab2 WHERE k > 3 +SELECT udf(k), v FROM tab2 WHERE udf(udf(k)) > 3 -- !query 5 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,v:int> -- !query 5 output -- !query 6 -SELECT * FROM tab1 +SELECT udf(k), v FROM tab1 INTERSECT ALL -SELECT CAST(1 AS BIGINT), CAST(2 AS BIGINT) +SELECT CAST(udf(1) AS BIGINT), CAST(udf(2) AS BIGINT) -- !query 6 schema -struct<k:bigint,v:bigint> +struct<CAST(udf(cast(k as string)) AS INT):bigint,v:bigint> -- !query 6 output 1 2 -- !query 7 -SELECT * FROM tab1 +SELECT k, udf(v) FROM tab1 INTERSECT ALL -SELECT array(1), 2 +SELECT array(1), udf(2) -- !query 7 schema struct<> -- !query 7 output -102,9 +102,9 IntersectAll can only be performed on tables with the compatible column types. a -- !query 8 -SELECT k FROM tab1 +SELECT udf(k) FROM tab1 INTERSECT ALL -SELECT k, v FROM tab2 +SELECT udf(k), udf(v) FROM tab2 -- !query 8 schema struct<> -- !query 8 output -113,13 +113,13 IntersectAll can only be performed on tables with the same number of columns, bu -- !query 9 -SELECT * FROM tab2 +SELECT udf(k), v FROM tab2 INTERSECT ALL -SELECT * FROM tab1 +SELECT k, udf(v) FROM tab1 INTERSECT ALL -SELECT * FROM tab2 +SELECT udf(k), udf(v) FROM tab2 -- !query 9 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,v:int> -- !query 9 output 1 2 1 2 -129,15 +129,15 NULL NULL -- !query 10 -SELECT * FROM tab1 +SELECT udf(k), v FROM tab1 EXCEPT -SELECT * FROM tab2 +SELECT k, udf(v) FROM tab2 UNION ALL -SELECT * FROM tab1 +SELECT k, udf(udf(v)) FROM tab1 INTERSECT ALL -SELECT * FROM tab2 +SELECT udf(k), v FROM tab2 -- !query 10 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,v:int> -- !query 10 output 1 2 1 2 -148,15 +148,15 NULL NULL -- !query 11 -SELECT * FROM tab1 +SELECT udf(k), udf(v) FROM tab1 EXCEPT -SELECT * FROM tab2 +SELECT udf(k), v FROM tab2 EXCEPT -SELECT * FROM tab1 +SELECT k, udf(v) FROM tab1 INTERSECT ALL -SELECT * FROM tab2 +SELECT udf(k), udf(udf(v)) FROM tab2 -- !query 11 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int> -- !query 11 output 1 3 -165,38 +165,38 struct<k:int,v:int> ( ( ( - SELECT * FROM tab1 + SELECT udf(k), v FROM tab1 EXCEPT - SELECT * FROM tab2 + SELECT k, udf(v) FROM tab2 ) EXCEPT - SELECT * FROM tab1 + SELECT udf(k), udf(v) FROM tab1 ) INTERSECT ALL - SELECT * FROM tab2 + SELECT udf(k), udf(v) FROM tab2 ) -- !query 12 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,v:int> -- !query 12 output -- !query 13 SELECT * -FROM (SELECT tab1.k, - tab2.v +FROM (SELECT udf(tab1.k), + udf(tab2.v) FROM tab1 JOIN tab2 - ON tab1.k = tab2.k) + ON udf(udf(tab1.k)) = tab2.k) INTERSECT ALL SELECT * -FROM (SELECT tab1.k, - tab2.v +FROM (SELECT udf(tab1.k), + udf(tab2.v) FROM tab1 JOIN tab2 - ON tab1.k = tab2.k) + ON udf(tab1.k) = udf(udf(tab2.k))) -- !query 13 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int> -- !query 13 output 1 2 1 2 -211,30 +211,30 struct<k:int,v:int> -- !query 14 SELECT * -FROM (SELECT tab1.k, - tab2.v +FROM (SELECT udf(tab1.k), + udf(tab2.v) FROM tab1 JOIN tab2 - ON tab1.k = tab2.k) + ON udf(tab1.k) = udf(tab2.k)) INTERSECT ALL SELECT * -FROM (SELECT tab2.v AS k, - tab1.k AS v +FROM (SELECT udf(tab2.v) AS k, + udf(tab1.k) AS v FROM tab1 JOIN tab2 - ON tab1.k = tab2.k) + ON tab1.k = udf(tab2.k)) -- !query 14 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int> -- !query 14 output -- !query 15 -SELECT v FROM tab1 GROUP BY v +SELECT udf(v) FROM tab1 GROUP BY v INTERSECT ALL -SELECT k FROM tab2 GROUP BY k +SELECT udf(udf(k)) FROM tab2 GROUP BY k -- !query 15 schema -struct<v:int> +struct<CAST(udf(cast(v as string)) AS INT):int> -- !query 15 output 2 3 -250,15 +250,15 spark.sql.legacy.setopsPrecedence.enabled true -- !query 17 -SELECT * FROM tab1 +SELECT udf(k), v FROM tab1 EXCEPT -SELECT * FROM tab2 +SELECT k, udf(v) FROM tab2 UNION ALL -SELECT * FROM tab1 +SELECT udf(k), udf(v) FROM tab1 INTERSECT ALL -SELECT * FROM tab2 +SELECT udf(udf(k)), udf(v) FROM tab2 -- !query 17 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,v:int> -- !query 17 output 1 2 1 2 -268,15 +268,15 NULL NULL -- !query 18 -SELECT * FROM tab1 +SELECT k, udf(v) FROM tab1 EXCEPT -SELECT * FROM tab2 +SELECT udf(k), v FROM tab2 UNION ALL -SELECT * FROM tab1 +SELECT udf(k), udf(v) FROM tab1 INTERSECT -SELECT * FROM tab2 +SELECT udf(k), udf(udf(v)) FROM tab2 -- !query 18 schema -struct<k:int,v:int> +struct<k:int,CAST(udf(cast(v as string)) AS INT):int> -- !query 18 output 1 2 2 3 ``` </details> ## How was this patch tested? Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). Closes #25119 from imback82/intersect-all-sql. Authored-by: Terry Kim <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…nto UDF test base ## What changes were proposed in this pull request? This PR adds some tests converted from `except-all.sql` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). <details><summary>Diff comparing to 'except-all.sql'</summary> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/except-all.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-except-all.sql.out index 01091a2..b7bfad0 100644 --- a/sql/core/src/test/resources/sql-tests/results/except-all.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-except-all.sql.out -49,11 +49,11 struct<> -- !query 4 -SELECT * FROM tab1 +SELECT udf(c1) FROM tab1 EXCEPT ALL -SELECT * FROM tab2 +SELECT udf(c1) FROM tab2 -- !query 4 schema -struct<c1:int> +struct<CAST(udf(cast(c1 as string)) AS INT):int> -- !query 4 output 0 2 -62,11 +62,11 NULL -- !query 5 -SELECT * FROM tab1 +SELECT udf(c1) FROM tab1 MINUS ALL -SELECT * FROM tab2 +SELECT udf(c1) FROM tab2 -- !query 5 schema -struct<c1:int> +struct<CAST(udf(cast(c1 as string)) AS INT):int> -- !query 5 output 0 2 -75,11 +75,11 NULL -- !query 6 -SELECT * FROM tab1 +SELECT udf(c1) FROM tab1 EXCEPT ALL -SELECT * FROM tab2 WHERE c1 IS NOT NULL +SELECT udf(c1) FROM tab2 WHERE udf(c1) IS NOT NULL -- !query 6 schema -struct<c1:int> +struct<CAST(udf(cast(c1 as string)) AS INT):int> -- !query 6 output 0 2 -89,21 +89,21 NULL -- !query 7 -SELECT * FROM tab1 WHERE c1 > 5 +SELECT udf(c1) FROM tab1 WHERE udf(c1) > 5 EXCEPT ALL -SELECT * FROM tab2 +SELECT udf(c1) FROM tab2 -- !query 7 schema -struct<c1:int> +struct<CAST(udf(cast(c1 as string)) AS INT):int> -- !query 7 output -- !query 8 -SELECT * FROM tab1 +SELECT udf(c1) FROM tab1 EXCEPT ALL -SELECT * FROM tab2 WHERE c1 > 6 +SELECT udf(c1) FROM tab2 WHERE udf(c1 > udf(6)) -- !query 8 schema -struct<c1:int> +struct<CAST(udf(cast(c1 as string)) AS INT):int> -- !query 8 output 0 1 -117,11 +117,11 NULL -- !query 9 -SELECT * FROM tab1 +SELECT udf(c1) FROM tab1 EXCEPT ALL -SELECT CAST(1 AS BIGINT) +SELECT CAST(udf(1) AS BIGINT) -- !query 9 schema -struct<c1:bigint> +struct<CAST(udf(cast(c1 as string)) AS INT):bigint> -- !query 9 output 0 2 -134,7 +134,7 NULL -- !query 10 -SELECT * FROM tab1 +SELECT udf(c1) FROM tab1 EXCEPT ALL SELECT array(1) -- !query 10 schema -145,62 +145,62 ExceptAll can only be performed on tables with the compatible column types. arra -- !query 11 -SELECT * FROM tab3 +SELECT udf(k), v FROM tab3 EXCEPT ALL -SELECT * FROM tab4 +SELECT k, udf(v) FROM tab4 -- !query 11 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,v:int> -- !query 11 output 1 2 1 3 -- !query 12 -SELECT * FROM tab4 +SELECT k, udf(v) FROM tab4 EXCEPT ALL -SELECT * FROM tab3 +SELECT udf(k), v FROM tab3 -- !query 12 schema -struct<k:int,v:int> +struct<k:int,CAST(udf(cast(v as string)) AS INT):int> -- !query 12 output 2 2 2 20 -- !query 13 -SELECT * FROM tab4 +SELECT udf(k), udf(v) FROM tab4 EXCEPT ALL -SELECT * FROM tab3 +SELECT udf(k), udf(v) FROM tab3 INTERSECT DISTINCT -SELECT * FROM tab4 +SELECT udf(k), udf(v) FROM tab4 -- !query 13 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int> -- !query 13 output 2 2 2 20 -- !query 14 -SELECT * FROM tab4 +SELECT udf(k), v FROM tab4 EXCEPT ALL -SELECT * FROM tab3 +SELECT k, udf(v) FROM tab3 EXCEPT DISTINCT -SELECT * FROM tab4 +SELECT udf(k), udf(v) FROM tab4 -- !query 14 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,v:int> -- !query 14 output -- !query 15 -SELECT * FROM tab3 +SELECT k, udf(v) FROM tab3 EXCEPT ALL -SELECT * FROM tab4 +SELECT udf(k), udf(v) FROM tab4 UNION ALL -SELECT * FROM tab3 +SELECT udf(k), v FROM tab3 EXCEPT DISTINCT -SELECT * FROM tab4 +SELECT k, udf(v) FROM tab4 -- !query 15 schema -struct<k:int,v:int> +struct<k:int,CAST(udf(cast(v as string)) AS INT):int> -- !query 15 output 1 3 -217,83 +217,83 ExceptAll can only be performed on tables with the same number of columns, but t -- !query 17 -SELECT * FROM tab3 +SELECT udf(k), udf(v) FROM tab3 EXCEPT ALL -SELECT * FROM tab4 +SELECT udf(k), udf(v) FROM tab4 UNION -SELECT * FROM tab3 +SELECT udf(k), udf(v) FROM tab3 EXCEPT DISTINCT -SELECT * FROM tab4 +SELECT udf(k), udf(v) FROM tab4 -- !query 17 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int> -- !query 17 output 1 3 -- !query 18 -SELECT * FROM tab3 +SELECT udf(k), udf(v) FROM tab3 MINUS ALL -SELECT * FROM tab4 +SELECT k, udf(v) FROM tab4 UNION -SELECT * FROM tab3 +SELECT udf(k), udf(v) FROM tab3 MINUS DISTINCT -SELECT * FROM tab4 +SELECT k, udf(v) FROM tab4 -- !query 18 schema -struct<k:int,v:int> +struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int> -- !query 18 output 1 3 -- !query 19 -SELECT * FROM tab3 +SELECT k, udf(v) FROM tab3 EXCEPT ALL -SELECT * FROM tab4 +SELECT udf(k), v FROM tab4 EXCEPT DISTINCT -SELECT * FROM tab3 +SELECT k, udf(v) FROM tab3 EXCEPT DISTINCT -SELECT * FROM tab4 +SELECT udf(k), v FROM tab4 -- !query 19 schema -struct<k:int,v:int> +struct<k:int,CAST(udf(cast(v as string)) AS INT):int> -- !query 19 output -- !query 20 SELECT * -FROM (SELECT tab3.k, - tab4.v +FROM (SELECT tab3.k, + udf(tab4.v) FROM tab3 JOIN tab4 - ON tab3.k = tab4.k) + ON udf(tab3.k) = tab4.k) EXCEPT ALL SELECT * -FROM (SELECT tab3.k, - tab4.v +FROM (SELECT udf(tab3.k), + tab4.v FROM tab3 JOIN tab4 - ON tab3.k = tab4.k) + ON tab3.k = udf(tab4.k)) -- !query 20 schema -struct<k:int,v:int> +struct<k:int,CAST(udf(cast(v as string)) AS INT):int> -- !query 20 output -- !query 21 SELECT * -FROM (SELECT tab3.k, - tab4.v +FROM (SELECT udf(udf(tab3.k)), + udf(tab4.v) FROM tab3 JOIN tab4 - ON tab3.k = tab4.k) + ON udf(udf(tab3.k)) = udf(tab4.k)) EXCEPT ALL SELECT * -FROM (SELECT tab4.v AS k, - tab3.k AS v +FROM (SELECT udf(tab4.v) AS k, + udf(udf(tab3.k)) AS v FROM tab3 JOIN tab4 - ON tab3.k = tab4.k) + ON udf(tab3.k) = udf(tab4.k)) -- !query 21 schema -struct<k:int,v:int> +struct<CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int> -- !query 21 output 1 2 1 2 -305,11 +305,11 struct<k:int,v:int> -- !query 22 -SELECT v FROM tab3 GROUP BY v +SELECT udf(v) FROM tab3 GROUP BY v EXCEPT ALL -SELECT k FROM tab4 GROUP BY k +SELECT udf(k) FROM tab4 GROUP BY k -- !query 22 schema -struct<v:int> +struct<CAST(udf(cast(v as string)) AS INT):int> -- !query 22 output 3 ``` </details> ## How was this patch tested? Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). Closes #25090 from imback82/except-all. Authored-by: Terry Kim <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…DF test base ## What changes were proposed in this pull request? This PR adds some tests converted from pivot.sql to test UDFs following the combination guide in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). <details><summary>Diff comparing to 'pivot.sql'</summary> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/pivot.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-pivot.sql.out index 9a8f783..cb9e4d7 100644 --- a/sql/core/src/test/resources/sql-tests/results/pivot.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-pivot.sql.out -1,5 +1,5 -- Automatically generated by SQLQueryTestSuite --- Number of queries: 32 +-- Number of queries: 30 -- !query 0 -40,14 +40,14 struct<> -- !query 3 SELECT * FROM ( - SELECT year, course, earnings FROM courseSales + SELECT udf(year), course, earnings FROM courseSales ) PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR course IN ('dotNET', 'Java') ) -- !query 3 schema -struct<year:int,dotNET:bigint,Java:bigint> +struct<CAST(udf(cast(year as string)) AS INT):int,dotNET:bigint,Java:bigint> -- !query 3 output 2012 15000 20000 2013 48000 30000 -56,7 +56,7 struct<year:int,dotNET:bigint,Java:bigint> -- !query 4 SELECT * FROM courseSales PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR year IN (2012, 2013) ) -- !query 4 schema -71,11 +71,11 SELECT * FROM ( SELECT year, course, earnings FROM courseSales ) PIVOT ( - sum(earnings), avg(earnings) + udf(sum(earnings)), udf(avg(earnings)) FOR course IN ('dotNET', 'Java') ) -- !query 5 schema -struct<year:int,dotNET_sum(CAST(earnings AS BIGINT)):bigint,dotNET_avg(CAST(earnings AS BIGINT)):double,Java_sum(CAST(earnings AS BIGINT)):bigint,Java_avg(CAST(earnings AS BIGINT)):double> +struct<year:int,dotNET_CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint,dotNET_CAST(udf(cast(avg(cast(earnings as bigint)) as string)) AS DOUBLE):double,Java_CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint,Java_CAST(udf(cast(avg(cast(earnings as bigint)) as string)) AS DOUBLE):double> -- !query 5 output 2012 15000 7500.0 20000 20000.0 2013 48000 48000.0 30000 30000.0 -83,10 +83,10 struct<year:int,dotNET_sum(CAST(earnings AS BIGINT)):bigint,dotNET_avg(CAST(earn -- !query 6 SELECT * FROM ( - SELECT course, earnings FROM courseSales + SELECT udf(course) as course, earnings FROM courseSales ) PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR course IN ('dotNET', 'Java') ) -- !query 6 schema -100,23 +100,23 SELECT * FROM ( SELECT year, course, earnings FROM courseSales ) PIVOT ( - sum(earnings), min(year) + udf(sum(udf(earnings))), udf(min(year)) FOR course IN ('dotNET', 'Java') ) -- !query 7 schema -struct<dotNET_sum(CAST(earnings AS BIGINT)):bigint,dotNET_min(year):int,Java_sum(CAST(earnings AS BIGINT)):bigint,Java_min(year):int> +struct<dotNET_CAST(udf(cast(sum(cast(cast(udf(cast(earnings as string)) as int) as bigint)) as string)) AS BIGINT):bigint,dotNET_CAST(udf(cast(min(year) as string)) AS INT):int,Java_CAST(udf(cast(sum(cast(cast(udf(cast(earnings as string)) as int) as bigint)) as string)) AS BIGINT):bigint,Java_CAST(udf(cast(min(year) as string)) AS INT):int> -- !query 7 output 63000 2012 50000 2012 -- !query 8 SELECT * FROM ( - SELECT course, year, earnings, s + SELECT course, year, earnings, udf(s) as s FROM courseSales JOIN years ON year = y ) PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR s IN (1, 2) ) -- !query 8 schema -135,11 +135,11 SELECT * FROM ( JOIN years ON year = y ) PIVOT ( - sum(earnings), min(s) + udf(sum(earnings)), udf(min(s)) FOR course IN ('dotNET', 'Java') ) -- !query 9 schema -struct<year:int,dotNET_sum(CAST(earnings AS BIGINT)):bigint,dotNET_min(s):int,Java_sum(CAST(earnings AS BIGINT)):bigint,Java_min(s):int> +struct<year:int,dotNET_CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint,dotNET_CAST(udf(cast(min(s) as string)) AS INT):int,Java_CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint,Java_CAST(udf(cast(min(s) as string)) AS INT):int> -- !query 9 output 2012 15000 1 20000 1 2013 48000 2 30000 2 -152,7 +152,7 SELECT * FROM ( JOIN years ON year = y ) PIVOT ( - sum(earnings * s) + udf(sum(earnings * s)) FOR course IN ('dotNET', 'Java') ) -- !query 10 schema -167,7 +167,7 SELECT 2012_s, 2013_s, 2012_a, 2013_a, c FROM ( SELECT year y, course c, earnings e FROM courseSales ) PIVOT ( - sum(e) s, avg(e) a + udf(sum(e)) s, udf(avg(e)) a FOR y IN (2012, 2013) ) -- !query 11 schema -182,7 +182,7 SELECT firstYear_s, secondYear_s, firstYear_a, secondYear_a, c FROM ( SELECT year y, course c, earnings e FROM courseSales ) PIVOT ( - sum(e) s, avg(e) a + udf(sum(e)) s, udf(avg(e)) a FOR y IN (2012 as firstYear, 2013 secondYear) ) -- !query 12 schema -195,7 +195,7 struct<firstYear_s:bigint,secondYear_s:bigint,firstYear_a:double,secondYear_a:do -- !query 13 SELECT * FROM courseSales PIVOT ( - abs(earnings) + udf(abs(earnings)) FOR year IN (2012, 2013) ) -- !query 13 schema -210,7 +210,7 SELECT * FROM ( SELECT year, course, earnings FROM courseSales ) PIVOT ( - sum(earnings), year + udf(sum(earnings)), year FOR course IN ('dotNET', 'Java') ) -- !query 14 schema -225,7 +225,7 SELECT * FROM ( SELECT course, earnings FROM courseSales ) PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR year IN (2012, 2013) ) -- !query 15 schema -240,11 +240,11 SELECT * FROM ( SELECT year, course, earnings FROM courseSales ) PIVOT ( - ceil(sum(earnings)), avg(earnings) + 1 as a1 + udf(ceil(udf(sum(earnings)))), avg(earnings) + 1 as a1 FOR course IN ('dotNET', 'Java') ) -- !query 16 schema -struct<year:int,dotNET_CEIL(sum(CAST(earnings AS BIGINT))):bigint,dotNET_a1:double,Java_CEIL(sum(CAST(earnings AS BIGINT))):bigint,Java_a1:double> +struct<year:int,dotNET_CAST(udf(cast(CEIL(cast(udf(cast(sum(cast(earnings as bigint)) as string)) as bigint)) as string)) AS BIGINT):bigint,dotNET_a1:double,Java_CAST(udf(cast(CEIL(cast(udf(cast(sum(cast(earnings as bigint)) as string)) as bigint)) as string)) AS BIGINT):bigint,Java_a1:double> -- !query 16 output 2012 15000 7501.0 20000 20001.0 2013 48000 48001.0 30000 30001.0 -255,7 +255,7 SELECT * FROM ( SELECT year, course, earnings FROM courseSales ) PIVOT ( - sum(avg(earnings)) + sum(udf(avg(earnings))) FOR course IN ('dotNET', 'Java') ) -- !query 17 schema -272,7 +272,7 SELECT * FROM ( JOIN years ON year = y ) PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR (course, year) IN (('dotNET', 2012), ('Java', 2013)) ) -- !query 18 schema -289,7 +289,7 SELECT * FROM ( JOIN years ON year = y ) PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR (course, s) IN (('dotNET', 2) as c1, ('Java', 1) as c2) ) -- !query 19 schema -306,7 +306,7 SELECT * FROM ( JOIN years ON year = y ) PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR (course, year) IN ('dotNET', 'Java') ) -- !query 20 schema -319,7 +319,7 Invalid pivot value 'dotNET': value data type string does not match pivot column -- !query 21 SELECT * FROM courseSales PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR year IN (s, 2013) ) -- !query 21 schema -332,7 +332,7 cannot resolve '`s`' given input columns: [coursesales.course, coursesales.earni -- !query 22 SELECT * FROM courseSales PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR year IN (course, 2013) ) -- !query 22 schema -343,151 +343,118 Literal expressions required for pivot values, found 'course#x'; -- !query 23 -SELECT * FROM ( - SELECT course, year, a - FROM courseSales - JOIN yearsWithComplexTypes ON year = y -) -PIVOT ( - min(a) - FOR course IN ('dotNET', 'Java') -) --- !query 23 schema -struct<year:int,dotNET:array<int>,Java:array<int>> --- !query 23 output -2012 [1,1] [1,1] -2013 [2,2] [2,2] - - --- !query 24 -SELECT * FROM ( - SELECT course, year, y, a - FROM courseSales - JOIN yearsWithComplexTypes ON year = y -) -PIVOT ( - max(a) - FOR (y, course) IN ((2012, 'dotNET'), (2013, 'Java')) -) --- !query 24 schema -struct<year:int,[2012, dotNET]:array<int>,[2013, Java]:array<int>> --- !query 24 output -2012 [1,1] NULL -2013 NULL [2,2] - - --- !query 25 SELECT * FROM ( SELECT earnings, year, a FROM courseSales JOIN yearsWithComplexTypes ON year = y ) PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR a IN (array(1, 1), array(2, 2)) ) --- !query 25 schema +-- !query 23 schema struct<year:int,[1, 1]:bigint,[2, 2]:bigint> --- !query 25 output +-- !query 23 output 2012 35000 NULL 2013 NULL 78000 --- !query 26 +-- !query 24 SELECT * FROM ( - SELECT course, earnings, year, a + SELECT course, earnings, udf(year) as year, a FROM courseSales JOIN yearsWithComplexTypes ON year = y ) PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR (course, a) IN (('dotNET', array(1, 1)), ('Java', array(2, 2))) ) --- !query 26 schema +-- !query 24 schema struct<year:int,[dotNET, [1, 1]]:bigint,[Java, [2, 2]]:bigint> --- !query 26 output +-- !query 24 output 2012 15000 NULL 2013 NULL 30000 --- !query 27 +-- !query 25 SELECT * FROM ( SELECT earnings, year, s FROM courseSales JOIN yearsWithComplexTypes ON year = y ) PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR s IN ((1, 'a'), (2, 'b')) ) --- !query 27 schema +-- !query 25 schema struct<year:int,[1, a]:bigint,[2, b]:bigint> --- !query 27 output +-- !query 25 output 2012 35000 NULL 2013 NULL 78000 --- !query 28 +-- !query 26 SELECT * FROM ( SELECT course, earnings, year, s FROM courseSales JOIN yearsWithComplexTypes ON year = y ) PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR (course, s) IN (('dotNET', (1, 'a')), ('Java', (2, 'b'))) ) --- !query 28 schema +-- !query 26 schema struct<year:int,[dotNET, [1, a]]:bigint,[Java, [2, b]]:bigint> --- !query 28 output +-- !query 26 output 2012 15000 NULL 2013 NULL 30000 --- !query 29 +-- !query 27 SELECT * FROM ( SELECT earnings, year, m FROM courseSales JOIN yearsWithComplexTypes ON year = y ) PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR m IN (map('1', 1), map('2', 2)) ) --- !query 29 schema +-- !query 27 schema struct<> --- !query 29 output +-- !query 27 output org.apache.spark.sql.AnalysisException Invalid pivot column 'm#x'. Pivot columns must be comparable.; --- !query 30 +-- !query 28 SELECT * FROM ( SELECT course, earnings, year, m FROM courseSales JOIN yearsWithComplexTypes ON year = y ) PIVOT ( - sum(earnings) + udf(sum(earnings)) FOR (course, m) IN (('dotNET', map('1', 1)), ('Java', map('2', 2))) ) --- !query 30 schema +-- !query 28 schema struct<> --- !query 30 output +-- !query 28 output org.apache.spark.sql.AnalysisException Invalid pivot column 'named_struct(course, course#x, m, m#x)'. Pivot columns must be comparable.; --- !query 31 +-- !query 29 SELECT * FROM ( - SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, "x" as x, "d" as d, "w" as w + SELECT course, earnings, udf("a") as a, udf("z") as z, udf("b") as b, udf("y") as y, + udf("c") as c, udf("x") as x, udf("d") as d, udf("w") as w FROM courseSales ) PIVOT ( - sum(Earnings) + udf(sum(Earnings)) FOR Course IN ('dotNET', 'Java') ) --- !query 31 schema +-- !query 29 schema struct<a:string,z:string,b:string,y:string,c:string,x:string,d:string,w:string,dotNET:bigint,Java:bigint> --- !query 31 output +-- !query 29 output a z b y c x d w 63000 50000 ``` </details> ## How was this patch tested? Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). Closes #25122 from chitralverma/SPARK-28286. Authored-by: chitralverma <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

## What changes were proposed in this pull request? This PR is to port timestamp.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/timestamp.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/timestamp.out When porting the test cases, found five PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28141](https://issues.apache.org/jira/browse/SPARK-28141): Timestamp type can not accept special values [SPARK-28259](https://issues.apache.org/jira/browse/SPARK-28259): Date/Time Output Styles and Date Order Conventions [SPARK-28425](https://issues.apache.org/jira/browse/SPARK-28425): Add more Date/Time Operators [SPARK-28420](https://issues.apache.org/jira/browse/SPARK-28420): Date/Time Functions: date_part [SPARK-28137](https://issues.apache.org/jira/browse/SPARK-28137): Data Type Formatting Functions [SPARK-28432](https://issues.apache.org/jira/browse/SPARK-28432): Date/Time Functions: make_date/make_timestamp Also, found one inconsistent behavior: [SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Spark SQL insert bad inputs to NULL ## How was this patch tested? N/A Closes #25181 from wangyum/SPARK-28138. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? This PR is to port numeric.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/select_implicit.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/select_implicit.out When porting the test cases, found one PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28329](https://issues.apache.org/jira/browse/SPARK-28329): SELECT INTO syntax ## How was this patch tested? N/A Closes #25152 from wangyum/SPARK-28388. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? This adds simple check for `count` argument: - If it is a `Column` we apply `_to_java_column` before invoking JVM counterpart - Otherwise we proceed as before. ## How was this patch tested? Manual testing. Closes #25193 from zero323/SPARK-28278. Authored-by: zero323 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…are missing ## What changes were proposed in this pull request? The Spark UI's stages table misrenders the input/output metrics columns when some tasks are missing input metrics. See the screenshot below for an example of the problem: ![image](https://user-images.githubusercontent.com/50748/61420042-a3abc100-a8b5-11e9-8a92-7986563ee712.png) This is because those columns' are defined as ```scala {if (hasInput(stage)) { metricInfo(task) { m => ... <td>....</td> } } ``` where `metricInfo` renders the node returned by the closure in case metrics are defined or returns `Nil` in case metrics are not defined. If metrics are undefined then we'll fail to render the empty `<td></td>` tag, causing columns to become misaligned as shown in the screenshot. To fix this, this patch changes this to ```scala {if (hasInput(stage)) { <td>{ metricInfo(task) { m => ... Unparsed(...) } }</td> } ``` which is an idiom that's already in use for the shuffle read / write columns. ## How was this patch tested? It isn't. I'm arguing for correctness because the modifications are consistent with rendering methods that work correctly for other columns. Closes #25183 from JoshRosen/joshrosen/fix-task-table-with-partial-io-metrics. Authored-by: Josh Rosen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? This PR is to port numeric.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/numeric.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out When porting the test cases, found four PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28315](https://issues.apache.org/jira/browse/SPARK-28315): Decimal can not accept `NaN` as input [SPARK-28317](https://issues.apache.org/jira/browse/SPARK-28317): Built-in Mathematical Functions: SCALE [SPARK-28318](https://issues.apache.org/jira/browse/SPARK-28318): Decimal can only support precision up to 38 [SPARK-28322](https://issues.apache.org/jira/browse/SPARK-28322): DIV support decimal type Also, found four inconsistent behavior: [SPARK-28316](https://issues.apache.org/jira/browse/SPARK-28316): Decimal precision issue [SPARK-28324](https://issues.apache.org/jira/browse/SPARK-28324): The LOG function using 10 as the base, but Spark using E [SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Spark SQL insert bad inputs to NULL [SPARK-28007](https://issues.apache.org/jira/browse/SPARK-28007): Caret operator (^) means bitwise XOR in Spark/Hive and exponentiation in Postgres ## How was this patch tested? N/A Closes #25092 from wangyum/SPARK-28312. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? The `DateTimeUtils.timestampAddInterval` method was rewritten by using Java 8 time API. To add months and microseconds, I used the `plusMonths()` and `plus()` methods of `ZonedDateTime`. Also the signature of `timestampAddInterval()` was changed to accept an `ZoneId` instance instead of `TimeZone`. Using `ZoneId` allows to avoid the conversion `TimeZone` -> `ZoneId` on every invoke of `timestampAddInterval()`. ## How was this patch tested? By existing test suites `DateExpressionsSuite`, `TypeCoercionSuite` and `CollectionExpressionsSuite`. Closes #25173 from MaxGekk/timestamp-add-interval. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Sean Owen <[email protected]>

### What changes were proposed in this pull request? `org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite` failed lately. After had a look at the logs it just shows the following fact without any details: ``` Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database ``` Since the issue is intermittent and not able to reproduce it we should add more debug information and wait for reproduction with the extended logs. ### Why are the changes needed? Failing test doesn't give enough debug information. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I've started the test manually and checked that such additional debug messages show up: ``` >>> KrbApReq: APOptions are 00000000 00000000 00000000 00000000 >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType Looking for keys for: kafka/localhostEXAMPLE.COM Added key: 17version: 0 Added key: 23version: 0 Added key: 16version: 0 Found unsupported keytype (3) for kafka/localhostEXAMPLE.COM >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType Using builtin default etypes for permitted_enctypes default etypes for permitted_enctypes: 17 16 23. >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType MemoryCache: add 1571936500/174770/16C565221B70AAB2BEFE31A83D13A2F4/client/localhostEXAMPLE.COM to client/localhostEXAMPLE.COM|kafka/localhostEXAMPLE.COM MemoryCache: Existing AuthList: #3: 1571936493/200803/8CD70D280B0862C5DA1FF901ECAD39FE/client/localhostEXAMPLE.COM #2: 1571936499/985009/BAD33290D079DD4E3579A8686EC326B7/client/localhostEXAMPLE.COM #1: 1571936499/995208/B76B9D78A9BE283AC78340157107FD40/client/localhostEXAMPLE.COM ``` Closes apache#26252 from gaborgsomogyi/SPARK-29580. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? fix the error caused by interval output in ExtractBenchmark ### Why are the changes needed? fix a bug in the test ```scala [info] Running case: cast to interval [error] Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot use interval type in the table schema.;; [error] OverwriteByExpression RelationV2[] noop-table, true, true [error] +- Project [(subtractdates(cast(cast(id#0L as timestamp) as date), -719162) + subtracttimestamps(cast(id#0L as timestamp), -30610249419876544)) AS ((CAST(CAST(id AS TIMESTAMP) AS DATE) - DATE '0001-01-01') + (CAST(id AS TIMESTAMP) - TIMESTAMP '1000-01-01 01:02:03.123456'))#2] [error] +- Range (1262304000, 1272304000, step=1, splits=Some(1)) [error] [error] at org.apache.spark.sql.catalyst.util.TypeUtils$.failWithIntervalType(TypeUtils.scala:106) [error] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$25(CheckAnalysis.scala:389) [error] at org.a ``` ### Does this PR introduce any user-facing change? no ### How was this patch tested? re-run benchmark Closes apache#27867 from yaooqinn/SPARK-31111. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…chmarks ### What changes were proposed in this pull request? Replace `CAST(... AS TIMESTAMP` by `TIMESTAMP_SECONDS` in the following benchmarks: - ExtractBenchmark - DateTimeBenchmark - FilterPushdownBenchmark - InExpressionBenchmark ### Why are the changes needed? The benchmarks fail w/o the changes: ``` [info] Running benchmark: datetime +/- interval [info] Running case: date + interval(m) [error] Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`id` AS TIMESTAMP)' due to data type mismatch: cannot cast bigint to timestamp,you can enable the casting by setting spark.sql.legacy.allowCastNumericToTimestamp to true,but we strongly recommend using function TIMESTAMP_SECONDS/TIMESTAMP_MILLIS/TIMESTAMP_MICROS instead.; line 1 pos 5; [error] 'Project [(cast(cast(id#0L as timestamp) as date) + 1 months) AS (CAST(CAST(id AS TIMESTAMP) AS DATE) + INTERVAL '1 months')#2] [error] +- Range (0, 10000000, step=1, splits=Some(1)) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected benchmarks. Closes apache#28843 from MaxGekk/GuoPhilipse-31710-fix-compatibility-followup. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? fix the error caused by interval output in ExtractBenchmark ### Why are the changes needed? fix a bug in the test ```scala [info] Running case: cast to interval [error] Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot use interval type in the table schema.;; [error] OverwriteByExpression RelationV2[] noop-table, true, true [error] +- Project [(subtractdates(cast(cast(id#0L as timestamp) as date), -719162) + subtracttimestamps(cast(id#0L as timestamp), -30610249419876544)) AS ((CAST(CAST(id AS TIMESTAMP) AS DATE) - DATE '0001-01-01') + (CAST(id AS TIMESTAMP) - TIMESTAMP '1000-01-01 01:02:03.123456'))#2] [error] +- Range (1262304000, 1272304000, step=1, splits=Some(1)) [error] [error] at org.apache.spark.sql.catalyst.util.TypeUtils$.failWithIntervalType(TypeUtils.scala:106) [error] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$25(CheckAnalysis.scala:389) [error] at org.a ``` ### Does this PR introduce any user-facing change? no ### How was this patch tested? re-run benchmark Closes apache#27867 from yaooqinn/SPARK-31111. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 2b46662) Signed-off-by: Wenchen Fan <[email protected]>

…e are foldable boolean types ### What changes were proposed in this pull request? Improve `SimplifyConditionals`. Simplify `If(cond, TrueLiteral, FalseLiteral)` to `cond`. Simplify `If(cond, FalseLiteral, TrueLiteral)` to `Not(cond)`. The use case is: ```sql create table t1 using parquet as select id from range(10); select if (id > 2, false, true) from t1; ``` Before this pr: ``` == Physical Plan == *(1) Project [if ((id#1L > 2)) false else true AS (IF((id > CAST(2 AS BIGINT)), false, true))#2] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == *(1) Project [(id#1L <= 2) AS (IF((id > CAST(2 AS BIGINT)), false, true))#2] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes apache#30849 from wangyum/SPARK-33798-2. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR intends to fix flaky GitHub Actions (GA) tests below in `transform.sql` (this flakiness does not seem to happen in the Jenkins tests): - https://github.com/apache/spark/runs/1592987501 - https://github.com/apache/spark/runs/1593196242 - https://github.com/apache/spark/runs/1595496305 - https://github.com/apache/spark/runs/1596309555 This is because the error message is different between test runs in GA (the error message seems to be truncated indeterministically) ,e.g., ``` # https://github.com/apache/spark/runs/1592987501 Expected "...h status 127. Error:[ /bin/bash: some_non_existent_command: command not found]", but got "...h status 127. Error:[]" Result did not match for query #2 # https://github.com/apache/spark/runs/1593196242 Expected "...istent_command: comm[and not found]", but got "...istent_command: comm[]" Result did not match for query #2 ``` The root cause of this indeterministic behaviour happening only in GA is not clear though, this test throws SparkException consistently even in GA. So, this PR proposes to make the test just check if it will be thrown when running it. This PR comes from the dongjoon-hyun comment: https://github.com/apache/spark/pull/29414/files#r547414513 ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes apache#30896 from maropu/SPARK-32106-FOLLOWUP. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…pendently in Scala 2.13 ### What changes were proposed in this pull request? Similar to SPARK-35532, the main change of this pr is add `scala-2.13` profile to external/kafka-0-10-sql/pom.xml, external/avro/pom.xml and sql/hive-thriftserver/pom.xml, the `scala-2.13` profile include dependency on `scala-parallel-collections_2.13`, then all(34) spark modules can maven test independently. ### Why are the changes needed? Ensure alll(34) spark modules can be maven test independently in Scala 2.13 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the GitHub Action Scala 2.13 job - Manual test： 1. Execute ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 ``` 2. maven test `external/kafka-0-10-sql` module ``` mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl external/kafka-0-10-sql ``` **before** ``` Discovery starting. Discovery completed in 857 milliseconds. Run starting. Expected test count is: 464 ... KafkaRelationSuiteV2: - explicit earliest to latest offsets - default starting and ending offsets - explicit offsets - default starting and ending offsets with headers - timestamp provided for starting and ending - timestamp provided for starting, offset provided for ending - timestamp provided for ending, offset provided for starting - timestamp provided for starting, ending not provided - timestamp provided for ending, starting not provided - global timestamp provided for starting and ending - no matched offset for timestamp - startingOffsets - preferences on offset related options - no matched offset for timestamp - endingOffsets *** RUN ABORTED *** java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217) ... Cause: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) ... ``` **After** ``` Run completed in 33 minutes, 51 seconds. Total number of tests run: 464 Suites: completed 31, aborted 0 Tests: succeeded 464, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` 3. maven test `external/avro` module ``` mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl external/avro ``` **before** ``` Discovery starting. Discovery completed in 2 seconds, 765 milliseconds. Run starting. Expected test count is: 255 AvroReadSchemaSuite: - append column at the end - hide column at the end - append column into middle - hide column in the middle - add a nested column at the end of the leaf struct column - add a nested column in the middle of the leaf struct column - add a nested column at the end of the middle struct column - add a nested column in the middle of the middle struct column - hide a nested column at the end of the leaf struct column - hide a nested column in the middle of the leaf struct column - hide a nested column at the end of the middle struct column - hide a nested column in the middle of the middle struct column *** RUN ABORTED *** java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217) ... Cause: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) ... ``` **After** ``` Run completed in 1 minute, 42 seconds. Total number of tests run: 255 Suites: completed 12, aborted 0 Tests: succeeded 255, failed 0, canceled 0, ignored 2, pending 0 All tests passed. ``` 4. maven test `sql/hive-thriftserver` module ``` mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl sql/hive-thriftserver ``` **before** ``` - union.sql *** FAILED *** "1 a 1 a 2 b 2 b" did not contain "Exception" Exception did not match for query #2 SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t1), expected: 1 a 1 a 2 b 2 b, but got: java.sql.SQLException org.apache.hive.service.cli.HiveSQLException: Error running query: java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:38) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:324) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:229) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:229) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:238) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:178) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:323) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:389) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3719) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2987) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3710) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3708) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2987) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:299) ... 16 more Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 40 more (ThriftServerQueryTestSuite.scala:209) ``` **After** ``` Run completed in 29 minutes, 17 seconds. Total number of tests run: 535 Suites: completed 20, aborted 0 Tests: succeeded 535, failed 0, canceled 0, ignored 17, pending 0 All tests passed. ``` Closes apache#32994 from LuciferYang/SPARK-35838. Authored-by: YangJie <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…iEnabled in 'cast string to date #2' ### What changes were proposed in this pull request? This PR fixes the test to make `CastWithAnsiOffSuite` properly respect `ansiEnabled` in `cast string to date #2` test by using `CastWithAnsiOffSuite.cast` instead of `Cast` expression. ### Why are the changes needed? To make the tests pass. Currently it fails when ANSI mode is on: https://github.com/apache/spark/runs/6786744647 ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested in my IDE. Closes apache#36802 from HyukjinKwon/SPARK-39321-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ly equivalent children in `RewriteDistinctAggregates` ### What changes were proposed in this pull request? In `RewriteDistinctAggregates`, when grouping aggregate expressions by function children, treat children that are semantically equivalent as the same. ### Why are the changes needed? This PR will reduce the number of projections in the Expand operator when there are multiple distinct aggregations with superficially different children. In some cases, it will eliminate the need for an Expand operator. Example: In the following query, the Expand operator creates 3\*n rows (where n is the number of incoming rows) because it has a projection for each of function children `b + 1`, `1 + b` and `c`. ``` create or replace temp view v1 as select * from values (1, 2, 3.0), (1, 3, 4.0), (2, 4, 2.5), (2, 3, 1.0) v1(a, b, c); select a, count(distinct b + 1), avg(distinct 1 + b) filter (where c > 0), sum(c) from v1 group by a; ``` The Expand operator has three projections (each producing a row for each incoming row): ``` [a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for regular aggregation) [a#87, (b#88 + 1), null, 1, null, null], <== projection #2 (for distinct aggregation of b + 1) [a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for distinct aggregation of 1 + b) ``` In reality, the Expand only needs one projection for `1 + b` and `b + 1`, because they are semantically equivalent. With the proposed change, the Expand operator's projections look like this: ``` [a#67, null, 0, null, UnscaledValue(c#69)], <== projection #1 (for regular aggregations) [a#67, (b#68 + 1), 1, (c#69 > 0.0), null]], <== projection #2 (for distinct aggregation on b + 1 and 1 + b) ``` With one less projection, Expand produces 2\*n rows instead of 3\*n rows, but still produces the correct result. In the case where all distinct aggregates have semantically equivalent children, the Expand operator is not needed at all. Benchmark code in the JIRA (SPARK-40382). Before the PR: ``` distinct aggregates: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ all semantically equivalent 14721 14859 195 5.7 175.5 1.0X some semantically equivalent 14569 14572 5 5.8 173.7 1.0X none semantically equivalent 14408 14488 113 5.8 171.8 1.0X ``` After the PR: ``` distinct aggregates: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ all semantically equivalent 3658 3692 49 22.9 43.6 1.0X some semantically equivalent 9124 9214 127 9.2 108.8 0.4X none semantically equivalent 14601 14777 250 5.7 174.1 0.3X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes apache#37825 from bersprockets/rewritedistinct_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR introduces sasl retry count in RetryingBlockTransferor. ### Why are the changes needed? Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario: 1. SaslTimeoutException 2. IOException 3. SaslTimeoutException 4. IOException Even though IOException at #2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step #4. Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New test is added, courtesy of Mridul. Closes apache#39611 from tedyu/sasl-cnt. Authored-by: Ted Yu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

Ngone51 and others added 30 commits July 9, 2019 15:49

nooberfsh and others added 17 commits July 16, 2019 16:35

ulysses-you merged commit 33a8b0e into ulysses-you:master Jul 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pull master #2

pull master #2

Uh oh!

ulysses-you commented Jul 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

pull master #2

pull master #2

Uh oh!

Conversation

ulysses-you commented Jul 19, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants