-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs #20908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
da8dbaf
342d222
88b65c5
091a761
729abc9
f5aeafc
7a42096
03798e0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4358,6 +4358,7 @@ def test_timestamp_dst(self): | |
| not _have_pandas or not _have_pyarrow, | ||
| _pandas_requirement_message or _pyarrow_requirement_message) | ||
| class PandasUDFTests(ReusedSQLTestCase): | ||
|
|
||
| def test_pandas_udf_basic(self): | ||
| from pyspark.rdd import PythonEvalType | ||
| from pyspark.sql.functions import pandas_udf, PandasUDFType | ||
|
|
@@ -4573,6 +4574,24 @@ def random_udf(v): | |
| random_udf = random_udf.asNondeterministic() | ||
| return random_udf | ||
|
|
||
| def test_pandas_udf_tokenize(self): | ||
| from pyspark.sql.functions import pandas_udf | ||
| tokenize = pandas_udf(lambda s: s.apply(lambda str: str.split(' ')), | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. instead of s.apply with a lambda, you could do
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hm. I thought this PR targets to clarify array type wuth primitive types. can we improve the test case here -https://github.com/holdenk/spark/blob/342d2228a5c68fd2c07bd8c1b518da6135ce1bf6/python/pyspark/sql/tests.py#L3998, and remove this test case?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer to have a test case for primitive array type as well.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is a pretty common use to tokenize, so I think it's fine to have an explicit test for this
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I dont think this PR targets to fix or support tokenizing in an udf ..
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @HyukjinKwon It doesn't, but given that the old documentation implied that the ionization usecase wouldn't work I thought it would be good to illustrate that it does in a test. |
||
| ArrayType(StringType())) | ||
| self.assertEqual(tokenize.returnType, ArrayType(StringType())) | ||
| df = self.spark.createDataFrame([("hi boo",), ("bye boo",)], ["vals"]) | ||
| result = df.select(tokenize("vals").alias("hi")) | ||
| self.assertEqual([Row(hi=[u'hi', u'boo']), Row(hi=[u'bye', u'boo'])], result.collect()) | ||
|
|
||
| def test_pandas_udf_nested_arrays(self): | ||
| from pyspark.sql.functions import pandas_udf | ||
| tokenize = pandas_udf(lambda s: s.apply(lambda str: [str.split(' ')]), | ||
| ArrayType(ArrayType(StringType()))) | ||
| self.assertEqual(tokenize.returnType, ArrayType(ArrayType(StringType()))) | ||
| df = self.spark.createDataFrame([("hi boo",), ("bye boo",)], ["vals"]) | ||
| result = df.select(tokenize("vals").alias("hi")) | ||
| self.assertEqual([Row(hi=[[u'hi', u'boo']]), Row(hi=[[u'bye', u'boo']])], result.collect()) | ||
|
|
||
| def test_vectorized_udf_basic(self): | ||
| from pyspark.sql.functions import pandas_udf, col, array | ||
| df = self.spark.range(10).select( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems unrelated ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, however makes it fit with the style of the rest of the file.