[SPARK-19548][SQL] Support Hive UDFs which return typed Lists/Maps #16886

hvanhovell · 2017-02-10T11:26:55Z

What changes were proposed in this pull request?

This PR adds support for Hive UDFs that return fully typed java Lists or Maps, for example List<String> or Map<String, Integer>. It is also allowed to nest these structures, for example Map<String, List<Integer>>. Raw collections or collections using wildcards are still not supported, and cannot be supported due to the lack of type information.

How was this patch tested?

Modified existing tests in HiveUDFSuite, and I have added test cases for raw collection and collection using wildcards.

hvanhovell · 2017-02-10T11:27:59Z

cc @cloud-fan @yhuai @maropu

SparkQA · 2017-02-10T11:29:31Z

Test build #72703 has finished for PR 16886 at commit 56cdabd.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-10T13:29:34Z

Test build #72706 has finished for PR 16886 at commit d84074e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-02-10T14:02:53Z

Looks great to me because Hive actually supports these types for UDF.

maropu · 2017-02-10T14:05:18Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala

      throw new AnalysisException(
-        "List type in java is unsupported because " +
-        "JVM type erasure makes spark fail to catch a component type in List<>")
+        "Raw list type in java is unsupported because Spark cannot infer the element type.")


Do we need this error handling? "Unsupported java type interface java.util.List" thrown in the bottom entry is not enough?

It is quite likely that a user/developer will make a mistake for either a list or a map. I think these errors are more informative than the generic error, so I would like to retain them for the sake of user experience.

Nit: All the entries for error handling would be better to be placed in the bottom

maropu · 2017-02-10T14:10:43Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala

+    case _: WildcardType =>
+      throw new AnalysisException(
+        "Collection types with wildcards (e.g. List<?> or Map<?, ?>) are unsupported because " +
+          "Spark cannot infer the data type for these type parameters.")


I think this explicit error message for the special case seems good to make users understood. So, would it be better to need an additional error handling for BoundedType, too?

BoundedType is a mockito class and not a JVM class. A bound type that cannot be translated to a DataType is caught by the final case in the match.

SparkQA · 2017-02-10T16:42:15Z

Test build #72711 has finished for PR 16886 at commit e9f9be5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-10T18:15:33Z

sql/hive/src/test/java/org/apache/spark/sql/hive/execution/UDFRawList.java

+/**
+ * UDF that returns a raw (non-parameterized) java List.
+ */
+public class UDFRawList extends UDF {


nit: in Spark java files should be indented with 2 spaces.

Ok, all files in that dir are indented with 4 spaces. I can modify those if you want me to.

seems half of them indented with 4 spaces, yes let's fix them together.

cloud-fan · 2017-02-10T18:16:46Z

LGTM

SparkQA · 2017-02-10T22:33:38Z

Test build #72719 has finished for PR 16886 at commit 8cf25b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-10T22:47:39Z

thanks, merging to master!

## What changes were proposed in this pull request? This PR adds support for Hive UDFs that return fully typed java Lists or Maps, for example `List<String>` or `Map<String, Integer>`. It is also allowed to nest these structures, for example `Map<String, List<Integer>>`. Raw collections or collections using wildcards are still not supported, and cannot be supported due to the lack of type information. ## How was this patch tested? Modified existing tests in `HiveUDFSuite`, and I have added test cases for raw collection and collection using wildcards. Author: Herman van Hovell <[email protected]> Closes apache#16886 from hvanhovell/SPARK-19548.

Support Hive UDFs which return typed Lists/Maps

56cdabd

Add license header.

d84074e

maropu reviewed Feb 10, 2017

View reviewed changes

hvanhovell added 2 commits February 10, 2017 15:56

Make Hive UDF support allow List/Map subclasses.

93628e6

Code Review

e9f9be5

cloud-fan reviewed Feb 10, 2017

View reviewed changes

Java Style.

8cf25b9

asfgit closed this in 226d388 Feb 10, 2017

[SPARK-19548][SQL] Support Hive UDFs which return typed Lists/Maps #16886

[SPARK-19548][SQL] Support Hive UDFs which return typed Lists/Maps #16886

Uh oh!

Conversation

hvanhovell commented Feb 10, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hvanhovell commented Feb 10, 2017

Uh oh!

SparkQA commented Feb 10, 2017

Uh oh!

SparkQA commented Feb 10, 2017

Uh oh!

maropu commented Feb 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 10, 2017

Uh oh!

SparkQA commented Feb 10, 2017

Uh oh!

cloud-fan commented Feb 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants