[SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should validate input types for column#19027
[SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should validate input types for column#19027HyukjinKwon wants to merge 5 commits intoapache:masterfrom
Conversation
|
Test build #81030 has finished for PR 19027 at commit
|
|
Cool looks to me like a very reasonable fix.
Could we perhaps add a test for numpy.bool_ or numpy.float_ (that it should fail)?
|
|
I like this approach @HyukjinKwon :D! |
|
Thanks @felixcheung and @holdenk. I just added a simple test with numpy.float. |
|
Test build #81053 has finished for PR 19027 at commit
|
|
Oops, looks I need to check if numpy is available. Let me rather take this one out here as I am trying to whitelist |
This reverts commit 5e21a7e.
|
Test build #81056 has finished for PR 19027 at commit
|
|
I'm ok without the test since this is unlikely to break in the future. We do have tests that depends on (optionally) numpy (and Arrow) - seems like we should be able to take on dependencies more formally so we could test them properly?
|
|
Will probably take a look through the problem in the near future including hard dependencies and etc. I took a quick look but I think I need more time but yes it looks appearently vaild point. |
|
LGTM. |
|
It's not specific to it, but fairly common when people are calling numpy in UDF and returning its scalar type as-is. These scalar "looks" like Python native types (numpy.float_ vs float). That's the case reported in JIRA and what I've run into. |
|
@felixcheung I'm sorry if I'm missing something but it sounds like it's a different problem from this pr? |
|
That's fine, @ueshin and @felixcheung. Adding few tests with |
|
Will merge this one BTW. Sounds we are fine. |
|
Merged to master. |
|
Sure - I think there are a number of different situations reported in the JIRA that could be separated into different fixes.
Let me know what I can help with!
|
What changes were proposed in this pull request?
While preparing to take over #16537, I realised a (I think) better approach to make the exception handling in one point.
This PR proposes to fix
_to_java_columninpyspark.sql.column, which most of functions infunctions.pyand some other APIs use. This_to_java_columnbasically looks not working with other types thanpyspark.sql.column.Columnor string (strandunicode).If this is not
Column, then it calls_create_column_from_namewhich callsfunctions.colwithin JVM:spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
Line 76 in 42b9eda
And it looks we only have
Stringone withcol.So, these should work:
whereas these do not:
Meaning most of functions using
_to_java_columnsuch asudforto_jsonor some other APIs throw an exception as below:After this PR:
How was this patch tested?
Unit tests added in
python/pyspark/sql/tests.pyand manual tests.