Skip to content

Conversation

@hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds support for Hive UDFs that return fully typed java Lists or Maps, for example List<String> or Map<String, Integer>. It is also allowed to nest these structures, for example Map<String, List<Integer>>. Raw collections or collections using wildcards are still not supported, and cannot be supported due to the lack of type information.

How was this patch tested?

Modified existing tests in HiveUDFSuite, and I have added test cases for raw collection and collection using wildcards.

@hvanhovell
Copy link
Contributor Author

cc @cloud-fan @yhuai @maropu

@SparkQA
Copy link

SparkQA commented Feb 10, 2017

Test build #72703 has finished for PR 16886 at commit 56cdabd.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2017

Test build #72706 has finished for PR 16886 at commit d84074e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Feb 10, 2017

Looks great to me because Hive actually supports these types for UDF.

throw new AnalysisException(
"List type in java is unsupported because " +
"JVM type erasure makes spark fail to catch a component type in List<>")
"Raw list type in java is unsupported because Spark cannot infer the element type.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this error handling? "Unsupported java type interface java.util.List" thrown in the bottom entry is not enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is quite likely that a user/developer will make a mistake for either a list or a map. I think these errors are more informative than the generic error, so I would like to retain them for the sake of user experience.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: All the entries for error handling would be better to be placed in the bottom

case _: WildcardType =>
throw new AnalysisException(
"Collection types with wildcards (e.g. List<?> or Map<?, ?>) are unsupported because " +
"Spark cannot infer the data type for these type parameters.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this explicit error message for the special case seems good to make users understood. So, would it be better to need an additional error handling for BoundedType, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BoundedType is a mockito class and not a JVM class. A bound type that cannot be translated to a DataType is caught by the final case in the match.

@SparkQA
Copy link

SparkQA commented Feb 10, 2017

Test build #72711 has finished for PR 16886 at commit e9f9be5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* UDF that returns a raw (non-parameterized) java List.
*/
public class UDFRawList extends UDF {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in Spark java files should be indented with 2 spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, all files in that dir are indented with 4 spaces. I can modify those if you want me to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems half of them indented with 4 spaces, yes let's fix them together.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Feb 10, 2017

Test build #72719 has finished for PR 16886 at commit 8cf25b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 226d388 Feb 10, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?
This PR adds support for Hive UDFs that return fully typed java Lists or Maps, for example `List<String>` or `Map<String, Integer>`.  It is also allowed to nest these structures, for example `Map<String, List<Integer>>`. Raw collections or collections using wildcards are still not supported, and cannot be supported due to the lack of type information.

## How was this patch tested?
Modified existing tests in `HiveUDFSuite`, and I have added test cases for raw collection and collection using wildcards.

Author: Herman van Hovell <[email protected]>

Closes apache#16886 from hvanhovell/SPARK-19548.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants