Skip to content

Conversation

@imback82
Copy link
Contributor

@imback82 imback82 commented Jul 11, 2019

What changes were proposed in this pull request?

This PR adds some tests converted from intersect-all.sql to test UDFs. Please see contribution guide of this umbrella ticket - SPARK-27921.

Diff comparing to 'intersect-all.sql'

diff --git a/sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-intersect-all.sql.out
index 63dd56ce46..0cb82be2da 100644
--- a/sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-intersect-all.sql.out
@@ -34,11 +34,11 @@ struct<>
 
 
 -- !query 2
-SELECT * FROM tab1
+SELECT udf(k), v FROM tab1
 INTERSECT ALL
-SELECT * FROM tab2
+SELECT k, udf(v) FROM tab2
 -- !query 2 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 2 output
 1	2
 1	2
@@ -48,11 +48,11 @@ NULL	NULL
 
 
 -- !query 3
-SELECT * FROM tab1
+SELECT k, udf(v) FROM tab1
 INTERSECT ALL
-SELECT * FROM tab1 WHERE k = 1
+SELECT udf(k), v FROM tab1 WHERE udf(k) = 1
 -- !query 3 schema
-struct<k:int,v:int>
+struct<k:int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 3 output
 1	2
 1	2
@@ -61,39 +61,39 @@ struct<k:int,v:int>
 
 
 -- !query 4
-SELECT * FROM tab1 WHERE k > 2
+SELECT udf(k), udf(v) FROM tab1 WHERE k > udf(2)
 INTERSECT ALL
-SELECT * FROM tab2
+SELECT udf(k), udf(v) FROM tab2
 -- !query 4 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 4 output
 
 
 
 -- !query 5
-SELECT * FROM tab1
+SELECT udf(k), v FROM tab1
 INTERSECT ALL
-SELECT * FROM tab2 WHERE k > 3
+SELECT udf(k), v FROM tab2 WHERE udf(udf(k)) > 3
 -- !query 5 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 5 output
 
 
 
 -- !query 6
-SELECT * FROM tab1
+SELECT udf(k), v FROM tab1
 INTERSECT ALL
-SELECT CAST(1 AS BIGINT), CAST(2 AS BIGINT)
+SELECT CAST(udf(1) AS BIGINT), CAST(udf(2) AS BIGINT)
 -- !query 6 schema
-struct<k:bigint,v:bigint>
+struct<CAST(udf(cast(k as string)) AS INT):bigint,v:bigint>
 -- !query 6 output
 1	2
 
 
 -- !query 7
-SELECT * FROM tab1
+SELECT k, udf(v) FROM tab1
 INTERSECT ALL
-SELECT array(1), 2
+SELECT array(1), udf(2)
 -- !query 7 schema
 struct<>
 -- !query 7 output
@@ -102,9 +102,9 @@ IntersectAll can only be performed on tables with the compatible column types. a
 
 
 -- !query 8
-SELECT k FROM tab1
+SELECT udf(k) FROM tab1
 INTERSECT ALL
-SELECT k, v FROM tab2
+SELECT udf(k), udf(v) FROM tab2
 -- !query 8 schema
 struct<>
 -- !query 8 output
@@ -113,13 +113,13 @@ IntersectAll can only be performed on tables with the same number of columns, bu
 
 
 -- !query 9
-SELECT * FROM tab2
+SELECT udf(k), v FROM tab2
 INTERSECT ALL
-SELECT * FROM tab1
+SELECT k, udf(v) FROM tab1
 INTERSECT ALL
-SELECT * FROM tab2
+SELECT udf(k), udf(v) FROM tab2
 -- !query 9 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 9 output
 1	2
 1	2
@@ -129,15 +129,15 @@ NULL	NULL
 
 
 -- !query 10
-SELECT * FROM tab1
+SELECT udf(k), v FROM tab1
 EXCEPT
-SELECT * FROM tab2
+SELECT k, udf(v) FROM tab2
 UNION ALL
-SELECT * FROM tab1
+SELECT k, udf(udf(v)) FROM tab1
 INTERSECT ALL
-SELECT * FROM tab2
+SELECT udf(k), v FROM tab2
 -- !query 10 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 10 output
 1	2
 1	2
@@ -148,15 +148,15 @@ NULL	NULL
 
 
 -- !query 11
-SELECT * FROM tab1
+SELECT udf(k), udf(v) FROM tab1
 EXCEPT
-SELECT * FROM tab2
+SELECT udf(k), v FROM tab2
 EXCEPT
-SELECT * FROM tab1
+SELECT k, udf(v) FROM tab1
 INTERSECT ALL
-SELECT * FROM tab2
+SELECT udf(k), udf(udf(v)) FROM tab2
 -- !query 11 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 11 output
 1	3
 
@@ -165,38 +165,38 @@ struct<k:int,v:int>
 (
   (
     (
-      SELECT * FROM tab1
+      SELECT udf(k), v FROM tab1
       EXCEPT
-      SELECT * FROM tab2
+      SELECT k, udf(v) FROM tab2
     )
     EXCEPT
-    SELECT * FROM tab1
+    SELECT udf(k), udf(v) FROM tab1
   )
   INTERSECT ALL
-  SELECT * FROM tab2
+  SELECT udf(k), udf(v) FROM tab2
 )
 -- !query 12 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 12 output
 
 
 
 -- !query 13
 SELECT * 
-FROM   (SELECT tab1.k, 
-               tab2.v 
+FROM   (SELECT udf(tab1.k),
+               udf(tab2.v)
         FROM   tab1 
                JOIN tab2 
-                 ON tab1.k = tab2.k)
+                 ON udf(udf(tab1.k)) = tab2.k)
 INTERSECT ALL 
 SELECT * 
-FROM   (SELECT tab1.k, 
-               tab2.v 
+FROM   (SELECT udf(tab1.k),
+               udf(tab2.v)
         FROM   tab1 
                JOIN tab2 
-                 ON tab1.k = tab2.k)
+                 ON udf(tab1.k) = udf(udf(tab2.k)))
 -- !query 13 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 13 output
 1	2
 1	2
@@ -211,30 +211,30 @@ struct<k:int,v:int>
 
 -- !query 14
 SELECT * 
-FROM   (SELECT tab1.k, 
-               tab2.v 
+FROM   (SELECT udf(tab1.k),
+               udf(tab2.v)
         FROM   tab1 
                JOIN tab2 
-                 ON tab1.k = tab2.k) 
+                 ON udf(tab1.k) = udf(tab2.k))
 INTERSECT ALL 
 SELECT * 
-FROM   (SELECT tab2.v AS k, 
-               tab1.k AS v 
+FROM   (SELECT udf(tab2.v) AS k,
+               udf(tab1.k) AS v
         FROM   tab1 
                JOIN tab2 
-                 ON tab1.k = tab2.k)
+                 ON tab1.k = udf(tab2.k))
 -- !query 14 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 14 output
 
 
 
 -- !query 15
-SELECT v FROM tab1 GROUP BY v
+SELECT udf(v) FROM tab1 GROUP BY v
 INTERSECT ALL
-SELECT k FROM tab2 GROUP BY k
+SELECT udf(udf(k)) FROM tab2 GROUP BY k
 -- !query 15 schema
-struct<v:int>
+struct<CAST(udf(cast(v as string)) AS INT):int>
 -- !query 15 output
 2
 3
@@ -250,15 +250,15 @@ spark.sql.legacy.setopsPrecedence.enabled	true
 
 
 -- !query 17
-SELECT * FROM tab1
+SELECT udf(k), v FROM tab1
 EXCEPT
-SELECT * FROM tab2
+SELECT k, udf(v) FROM tab2
 UNION ALL
-SELECT * FROM tab1
+SELECT udf(k), udf(v) FROM tab1
 INTERSECT ALL
-SELECT * FROM tab2
+SELECT udf(udf(k)), udf(v) FROM tab2
 -- !query 17 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 17 output
 1	2
 1	2
@@ -268,15 +268,15 @@ NULL	NULL
 
 
 -- !query 18
-SELECT * FROM tab1
+SELECT k, udf(v) FROM tab1
 EXCEPT
-SELECT * FROM tab2
+SELECT udf(k), v FROM tab2
 UNION ALL
-SELECT * FROM tab1
+SELECT udf(k), udf(v) FROM tab1
 INTERSECT
-SELECT * FROM tab2
+SELECT udf(k), udf(udf(v)) FROM tab2
 -- !query 18 schema
-struct<k:int,v:int>
+struct<k:int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 18 output
 1	2
 2	3

How was this patch tested?

Tested as guided in SPARK-27921.

@SparkQA
Copy link

SparkQA commented Jul 11, 2019

Test build #107544 has finished for PR 25119 at commit 6a392b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

udf(tab1.k) AS v
FROM tab1
JOIN tab2
ON CAST(udf(tab1.k) AS BIGINT) = CAST(udf(tab2.k) AS BIGINT));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could try udf(udf(tab1.k) = udf(tab2.k)) or udf(udf(tab1.k) = tab2.k)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon, did you really want this? This removes the join condition, and I get the following message:

Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;

Did you mean udf(udf(tab1.k)) = udf(tab2.k) or udf(udf(tab1.k)) = tab2.k.

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107814 has finished for PR 25119 at commit 6a392b4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@imback82
Copy link
Contributor Author

@HyukjinKwon, I think I addressed all your comments. Please re-review this. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107830 has finished for PR 25119 at commit 42cf16b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

LGTM

Merged to master.

@HyukjinKwon
Copy link
Member

Thanks for working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants