-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs #20908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @BryanCutler |
|
Test build #88600 has finished for PR 20908 at commit
|
python/pyspark/sql/functions.py
Outdated
| A scalar UDF defines a transformation: One or more `pandas.Series` -> A `pandas.Series`. | ||
| The returnType should be a primitive data type, e.g., :class:`DoubleType`. | ||
| The returnType should be a primitive data type, e.g., :class:`DoubleType` or | ||
| arrays of a primitive data type (e.g. :class:`ArrayType`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could now be more than just primitive types, I believe all of pyspark.type.DataTypes except for MapType, StructType, BinaryType and nested Arrays (that needs to be checked).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should (e.g. :class:ArrayType) be (e.g. :class:ArrayType(DoubleType))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked nested arrays do not currently work. I'll an explicit test to check that this fails, and when the test starts passing we can update the documentation.
|
|
||
| def test_pandas_udf_tokenize(self): | ||
| from pyspark.sql.functions import pandas_udf | ||
| tokenize = pandas_udf(lambda s: s.apply(lambda str: str.split(' ')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of s.apply with a lambda, you could do lambda s: s.str.split(' '), both do the same just a little more compact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm. I thought this PR targets to clarify array type wuth primitive types. can we improve the test case here -https://github.com/holdenk/spark/blob/342d2228a5c68fd2c07bd8c1b518da6135ce1bf6/python/pyspark/sql/tests.py#L3998, and remove this test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to have a test case for primitive array type as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a pretty common use to tokenize, so I think it's fine to have an explicit test for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think this PR targets to fix or support tokenizing in an udf ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon It doesn't, but given that the old documentation implied that the ionization usecase wouldn't work I thought it would be good to illustrate that it does in a test.
| not _have_pandas or not _have_pyarrow, | ||
| _pandas_requirement_message or _pyarrow_requirement_message) | ||
| class PandasUDFTests(ReusedSQLTestCase): | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems unrelated ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, however makes it fit with the style of the rest of the file.
python/pyspark/sql/tests.py
Outdated
| self.assertEqual(udf.returnType, ArrayType(StringType())) | ||
| df = self.spark.createDataFrame([("hi boo",), ("bye boo",)], ["vals"]) | ||
| result = df.select(tokenize("vals").alias("hi")) | ||
| self.assertEqual([], result.collect()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something? Is it equal to []?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya good point -- meant to update this.
…/nested arrays and add a test expected to fail for nested arrays, and fix test for tokenizing.
|
Test build #88765 has finished for PR 20908 at commit
|
python/pyspark/sql/tests.py
Outdated
| ArrayType(ArrayType(StringType()))) | ||
| result = df.select(tokenize("vals").alias("hi")) | ||
| # If we start supporting nested arrays we should update the documentation in functions.py | ||
| self.assertRaises(ArrowTypeError, result.collect()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put this under with QuietTest(self.sc): to suppress the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good.
|
Test build #88824 has finished for PR 20908 at commit
|
python/pyspark/sql/tests.py
Outdated
| result = df.select(tokenize("vals").alias("hi")) | ||
| self.assertEqual([Row(hi=[u'hi', u'boo']), Row(hi=[u'bye', u'boo'])], result.collect()) | ||
|
|
||
| def test_pandas_udf_nested_arrays_does_not_work(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @holdenk , I should have been more clear about ArrayType support. Nested Arrays actually do work ok, it's primarily use with timestamps/dates that need to be adjusted, and lack of actual testing to verify it. I'll update SPARK-21187 to reflect this.
I ran the test below and it does work, you just need to define df from above and then the collected result is:
[Row(hi=[[u'hi', u'boo']]), Row(hi=[[u'bye', u'boo']])]
(also ArrowTypeError isn't defined, but should just be Exception and assertRaises is expecting a callable where result.collect() is)
If you could fix this up to test that nested arrays for types other than date/timestamps work, that would be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, that makes more sense.
…turn-types-in-scalar-with-arrow-udfs
|
Test build #89452 has finished for PR 20908 at commit
|
|
Jenkins retest this please. |
|
Test build #90235 has finished for PR 20908 at commit
|
|
Jenkins retest this please |
|
Test build #92976 has finished for PR 20908 at commit
|
…turn-types-in-scalar-with-arrow-udfs
|
Test build #94697 has finished for PR 20908 at commit
|
|
Test build #94698 has finished for PR 20908 at commit
|
|
Jenkins retest this please. |
|
Test build #95850 has finished for PR 20908 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I ran the tests with pyarrow 0.10.0 just to verify there was no regression.
…lar with arrow udfs ## What changes were proposed in this pull request? Clarify docstring for Scalar functions ## How was this patch tested? Adds a unit test showing use similar to wordcount, there's existing unit test for array of floats as well. Closes #20908 from holdenk/SPARK-23672-document-support-for-nested-return-types-in-scalar-with-arrow-udfs. Authored-by: Holden Karau <[email protected]> Signed-off-by: Bryan Cutler <[email protected]> (cherry picked from commit da5685b) Signed-off-by: Bryan Cutler <[email protected]>
|
merged to master and branch-2.4, thanks @holdenk ! |
What changes were proposed in this pull request?
Clarify docstring for Scalar functions
How was this patch tested?
Adds a unit test showing use similar to wordcount, there's existing unit test for array of floats as well.