[SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should validate input types for column by HyukjinKwon · Pull Request #19027 · apache/spark

HyukjinKwon · 2017-08-23T09:21:15Z

What changes were proposed in this pull request?

While preparing to take over #16537, I realised a (I think) better approach to make the exception handling in one point.

This PR proposes to fix _to_java_column in pyspark.sql.column, which most of functions in functions.py and some other APIs use. This _to_java_column basically looks not working with other types than pyspark.sql.column.Column or string (str and unicode).

If this is not Column, then it calls _create_column_from_name which calls functions.col within JVM:

spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Line 76 in 42b9eda

def col(colName: String): Column = Column(colName)

And it looks we only have String one with col.

So, these should work:

>>> from pyspark.sql.column import _to_java_column, Column
>>> _to_java_column("a")
JavaObject id=o28
>>> _to_java_column(u"a")
JavaObject id=o29
>>> _to_java_column(spark.range(1).id)
JavaObject id=o33

whereas these do not:

>>> _to_java_column(1)

...
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
    ...

>>> _to_java_column([])

...
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist
    ...

>>> class A(): pass
>>> _to_java_column(A())

...
AttributeError: 'A' object has no attribute '_get_object_id'

Meaning most of functions using _to_java_column such as udf or to_json or some other APIs throw an exception as below:

>>> from pyspark.sql.functions import udf
>>> udf(lambda x: x)(None)

...
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.col.
: java.lang.NullPointerException
    ...

>>> from pyspark.sql.functions import to_json
>>> to_json(None)

...
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.col.
: java.lang.NullPointerException
    ...

After this PR:

>>> from pyspark.sql.functions import udf
>>> udf(lambda x: x)(None)
...

TypeError: Invalid argument, not a string or column: None of type <type 'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' functions.

>>> from pyspark.sql.functions import to_json
>>> to_json(None)

...
TypeError: Invalid argument, not a string or column: None of type <type 'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' functions.

How was this patch tested?

Unit tests added in python/pyspark/sql/tests.py and manual tests.

HyukjinKwon · 2017-08-23T09:24:00Z

cc @zero323, @rdblue, @nchammas, @holdenk, @ueshin and @felixcheung. Could you take a look please? I think it is a small fix but the advantage is quite large.

SparkQA · 2017-08-23T09:49:20Z

Test build #81030 has finished for PR 19027 at commit d14c2cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-08-23T15:30:43Z

Cool looks to me like a very reasonable fix. Could we perhaps add a test for numpy.bool_ or numpy.float_ (that it should fail)?

holdenk · 2017-08-23T18:37:12Z

I like this approach @HyukjinKwon :D!

HyukjinKwon · 2017-08-23T23:23:14Z

Thanks @felixcheung and @holdenk. I just added a simple test with numpy.float.

SparkQA · 2017-08-23T23:43:55Z

Test build #81053 has finished for PR 19027 at commit 5e21a7e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-08-23T23:55:17Z

Oops, looks I need to check if numpy is available. Let me rather take this one out here as I am trying to whitelist basestring if you don't mind. I tested it with numpy in my local for your concern @felixcheung and it looks fine.

This reverts commit 5e21a7e.

SparkQA · 2017-08-24T01:02:56Z

Test build #81056 has finished for PR 19027 at commit 4abaef7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-08-24T03:55:34Z

I'm ok without the test since this is unlikely to break in the future. We do have tests that depends on (optionally) numpy (and Arrow) - seems like we should be able to take on dependencies more formally so we could test them properly?

HyukjinKwon · 2017-08-24T04:56:32Z

Will probably take a look through the problem in the near future including hard dependencies and etc. I took a quick look but I think I need more time but yes it looks appearently vaild point.

ueshin · 2017-08-24T05:09:44Z

LGTM.
Btw, I'm just curious why we need tests with numpy here.

felixcheung · 2017-08-24T08:06:22Z

It's not specific to it, but fairly common when people are calling numpy in UDF and returning its scalar type as-is. These scalar "looks" like Python native types (numpy.float_ vs float).

That's the case reported in JIRA and what I've run into.

ueshin · 2017-08-24T09:45:59Z

@felixcheung I'm sorry if I'm missing something but it sounds like it's a different problem from this pr?

HyukjinKwon · 2017-08-24T11:13:12Z

That's fine, @ueshin and @felixcheung. Adding few tests with numpy type might be an extra bit and (possibly) unrelated vs it's easy to add a test and might be a (possibly) common case users would try first. Of course, supporting numpy types properly should be orthogonal.

HyukjinKwon · 2017-08-24T11:28:21Z

Will merge this one BTW. Sounds we are fine.

HyukjinKwon · 2017-08-24T11:29:40Z

Merged to master.

felixcheung · 2017-08-24T16:33:31Z

Sure - I think there are a number of different situations reported in the JIRA that could be separated into different fixes. Let me know what I can help with!

zero323 and others added 2 commits August 23, 2017 16:29

Validate types in UserDefinedFunction.__call__

b830638

Validate column types

d14c2cc

typo: plural to singular (functions -> function) in the error message

a28309c

Add a simple test with numpy.float

5e21a7e

Revert "Add a simple test with numpy.float"

4abaef7

This reverts commit 5e21a7e.

felixcheung approved these changes Aug 24, 2017

View reviewed changes

asfgit closed this in dc5d34d Aug 24, 2017

HyukjinKwon deleted the SPARK-19165 branch January 2, 2018 03:37

Conversation

HyukjinKwon commented Aug 23, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Aug 23, 2017

Uh oh!

SparkQA commented Aug 23, 2017

Uh oh!

felixcheung commented Aug 23, 2017 via email

Uh oh!

holdenk commented Aug 23, 2017

Uh oh!

HyukjinKwon commented Aug 23, 2017

Uh oh!

SparkQA commented Aug 23, 2017

Uh oh!

HyukjinKwon commented Aug 23, 2017

Uh oh!

SparkQA commented Aug 24, 2017

Uh oh!

felixcheung commented Aug 24, 2017 via email

Uh oh!

HyukjinKwon commented Aug 24, 2017

Uh oh!

ueshin commented Aug 24, 2017

Uh oh!

felixcheung commented Aug 24, 2017

Uh oh!

ueshin commented Aug 24, 2017

Uh oh!

HyukjinKwon commented Aug 24, 2017

Uh oh!

HyukjinKwon commented Aug 24, 2017

Uh oh!

HyukjinKwon commented Aug 24, 2017

Uh oh!

felixcheung commented Aug 24, 2017 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments