Skip to content

[SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should validate input types for column#19027

Closed
HyukjinKwon wants to merge 5 commits intoapache:masterfrom
HyukjinKwon:SPARK-19165
Closed

[SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should validate input types for column#19027
HyukjinKwon wants to merge 5 commits intoapache:masterfrom
HyukjinKwon:SPARK-19165

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

While preparing to take over #16537, I realised a (I think) better approach to make the exception handling in one point.

This PR proposes to fix _to_java_column in pyspark.sql.column, which most of functions in functions.py and some other APIs use. This _to_java_column basically looks not working with other types than pyspark.sql.column.Column or string (str and unicode).

If this is not Column, then it calls _create_column_from_name which calls functions.col within JVM:

def col(colName: String): Column = Column(colName)

And it looks we only have String one with col.

So, these should work:

>>> from pyspark.sql.column import _to_java_column, Column
>>> _to_java_column("a")
JavaObject id=o28
>>> _to_java_column(u"a")
JavaObject id=o29
>>> _to_java_column(spark.range(1).id)
JavaObject id=o33

whereas these do not:

>>> _to_java_column(1)
...
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
    ...
>>> _to_java_column([])
...
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist
    ...
>>> class A(): pass
>>> _to_java_column(A())
...
AttributeError: 'A' object has no attribute '_get_object_id'

Meaning most of functions using _to_java_column such as udf or to_json or some other APIs throw an exception as below:

>>> from pyspark.sql.functions import udf
>>> udf(lambda x: x)(None)
...
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.col.
: java.lang.NullPointerException
    ...
>>> from pyspark.sql.functions import to_json
>>> to_json(None)
...
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.col.
: java.lang.NullPointerException
    ...

After this PR:

>>> from pyspark.sql.functions import udf
>>> udf(lambda x: x)(None)
...
TypeError: Invalid argument, not a string or column: None of type <type 'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' functions.
>>> from pyspark.sql.functions import to_json
>>> to_json(None)
...
TypeError: Invalid argument, not a string or column: None of type <type 'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' functions.

How was this patch tested?

Unit tests added in python/pyspark/sql/tests.py and manual tests.

@HyukjinKwon
Copy link
Member Author

cc @zero323, @rdblue, @nchammas, @holdenk, @ueshin and @felixcheung. Could you take a look please? I think it is a small fix but the advantage is quite large.

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81030 has finished for PR 19027 at commit d14c2cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

felixcheung commented Aug 23, 2017 via email

@holdenk
Copy link
Contributor

holdenk commented Aug 23, 2017

I like this approach @HyukjinKwon :D!

@HyukjinKwon
Copy link
Member Author

Thanks @felixcheung and @holdenk. I just added a simple test with numpy.float.

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81053 has finished for PR 19027 at commit 5e21a7e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Oops, looks I need to check if numpy is available. Let me rather take this one out here as I am trying to whitelist basestring if you don't mind. I tested it with numpy in my local for your concern @felixcheung and it looks fine.

@SparkQA
Copy link

SparkQA commented Aug 24, 2017

Test build #81056 has finished for PR 19027 at commit 4abaef7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

felixcheung commented Aug 24, 2017 via email

@HyukjinKwon
Copy link
Member Author

Will probably take a look through the problem in the near future including hard dependencies and etc. I took a quick look but I think I need more time but yes it looks appearently vaild point.

@ueshin
Copy link
Member

ueshin commented Aug 24, 2017

LGTM.
Btw, I'm just curious why we need tests with numpy here.

@felixcheung
Copy link
Member

It's not specific to it, but fairly common when people are calling numpy in UDF and returning its scalar type as-is. These scalar "looks" like Python native types (numpy.float_ vs float).

That's the case reported in JIRA and what I've run into.

@ueshin
Copy link
Member

ueshin commented Aug 24, 2017

@felixcheung I'm sorry if I'm missing something but it sounds like it's a different problem from this pr?

@HyukjinKwon
Copy link
Member Author

That's fine, @ueshin and @felixcheung. Adding few tests with numpy type might be an extra bit and (possibly) unrelated vs it's easy to add a test and might be a (possibly) common case users would try first. Of course, supporting numpy types properly should be orthogonal.

@HyukjinKwon
Copy link
Member Author

Will merge this one BTW. Sounds we are fine.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@asfgit asfgit closed this in dc5d34d Aug 24, 2017
@felixcheung
Copy link
Member

felixcheung commented Aug 24, 2017 via email

@HyukjinKwon HyukjinKwon deleted the SPARK-19165 branch January 2, 2018 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

Comments