[SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs #20908

holdenk · 2018-03-26T19:13:27Z

What changes were proposed in this pull request?

Clarify docstring for Scalar functions

How was this patch tested?

Adds a unit test showing use similar to wordcount, there's existing unit test for array of floats as well.

holdenk · 2018-03-26T19:13:37Z

cc @BryanCutler

SparkQA · 2018-03-26T19:44:01Z

Test build #88600 has finished for PR 20908 at commit 342d222.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-03-26T23:32:10Z

python/pyspark/sql/functions.py

       A scalar UDF defines a transformation: One or more `pandas.Series` -> A `pandas.Series`.
-       The returnType should be a primitive data type, e.g., :class:`DoubleType`.
+       The returnType should be a primitive data type, e.g., :class:`DoubleType` or
+       arrays of a primitive data type (e.g. :class:`ArrayType`).


It could now be more than just primitive types, I believe all of pyspark.type.DataTypes except for MapType, StructType, BinaryType and nested Arrays (that needs to be checked).

Should (e.g. :class:ArrayType) be (e.g. :class:ArrayType(DoubleType))?

Checked nested arrays do not currently work. I'll an explicit test to check that this fails, and when the test starts passing we can update the documentation.

BryanCutler · 2018-03-26T23:47:10Z

python/pyspark/sql/tests.py


+    def test_pandas_udf_tokenize(self):
+        from pyspark.sql.functions import pandas_udf
+        tokenize = pandas_udf(lambda s: s.apply(lambda str: str.split(' ')),


instead of s.apply with a lambda, you could do lambda s: s.str.split(' '), both do the same just a little more compact

hm. I thought this PR targets to clarify array type wuth primitive types. can we improve the test case here -https://github.com/holdenk/spark/blob/342d2228a5c68fd2c07bd8c1b518da6135ce1bf6/python/pyspark/sql/tests.py#L3998, and remove this test case?

I prefer to have a test case for primitive array type as well.

I think this is a pretty common use to tokenize, so I think it's fine to have an explicit test for this

I dont think this PR targets to fix or support tokenizing in an udf ..

@HyukjinKwon It doesn't, but given that the old documentation implied that the ionization usecase wouldn't work I thought it would be good to illustrate that it does in a test.

HyukjinKwon · 2018-03-27T02:49:03Z

python/pyspark/sql/tests.py

    not _have_pandas or not _have_pyarrow,
    _pandas_requirement_message or _pyarrow_requirement_message)
 class PandasUDFTests(ReusedSQLTestCase):
+


seems unrelated ..

It is, however makes it fit with the style of the rest of the file.

viirya · 2018-03-27T03:15:31Z

python/pyspark/sql/tests.py

+        self.assertEqual(udf.returnType, ArrayType(StringType()))
+        df = self.spark.createDataFrame([("hi boo",), ("bye boo",)], ["vals"])
+        result = df.select(tokenize("vals").alias("hi"))
+        self.assertEqual([], result.collect())


Am I missing something? Is it equal to []?

@viirya good point -- meant to update this.

…/nested arrays and add a test expected to fail for nested arrays, and fix test for tokenizing.

SparkQA · 2018-03-30T21:29:18Z

Test build #88765 has finished for PR 20908 at commit 88b65c5.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-04-02T17:44:28Z

python/pyspark/sql/tests.py

+                              ArrayType(ArrayType(StringType())))
+        result = df.select(tokenize("vals").alias("hi"))
+        # If we start supporting nested arrays we should update the documentation in functions.py
+        self.assertRaises(ArrowTypeError, result.collect())


Could you put this under with QuietTest(self.sc): to suppress the error?

Sure, sounds good.

SparkQA · 2018-04-02T18:40:07Z

Test build #88824 has finished for PR 20908 at commit 091a761.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-04-03T17:36:54Z

python/pyspark/sql/tests.py

+        result = df.select(tokenize("vals").alias("hi"))
+        self.assertEqual([Row(hi=[u'hi', u'boo']), Row(hi=[u'bye', u'boo'])], result.collect())
+
+    def test_pandas_udf_nested_arrays_does_not_work(self):


Sorry @holdenk , I should have been more clear about ArrayType support. Nested Arrays actually do work ok, it's primarily use with timestamps/dates that need to be adjusted, and lack of actual testing to verify it. I'll update SPARK-21187 to reflect this.

I ran the test below and it does work, you just need to define df from above and then the collected result is:
[Row(hi=[[u'hi', u'boo']]), Row(hi=[[u'bye', u'boo']])]

(also ArrowTypeError isn't defined, but should just be Exception and assertRaises is expecting a callable where result.collect() is)

If you could fix this up to test that nested arrays for types other than date/timestamps work, that would be great!

Awesome, that makes more sense.

…turn-types-in-scalar-with-arrow-udfs

SparkQA · 2018-04-17T12:42:17Z

Test build #89452 has finished for PR 20908 at commit f5aeafc.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-05-04T23:38:33Z

Jenkins retest this please.

SparkQA · 2018-05-05T00:10:22Z

Test build #90235 has finished for PR 20908 at commit f5aeafc.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-07-13T17:08:37Z

Jenkins retest this please

SparkQA · 2018-07-13T17:42:35Z

Test build #92976 has finished for PR 20908 at commit f5aeafc.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…turn-types-in-scalar-with-arrow-udfs

…g it.

SparkQA · 2018-08-13T17:55:44Z

Test build #94697 has finished for PR 20908 at commit 7a42096.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-13T18:21:26Z

Test build #94698 has finished for PR 20908 at commit 03798e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-09-09T21:06:19Z

Jenkins retest this please.

SparkQA · 2018-09-09T21:41:18Z

Test build #95850 has finished for PR 20908 at commit 03798e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM. I ran the tests with pyarrow 0.10.0 just to verify there was no regression.

…lar with arrow udfs ## What changes were proposed in this pull request? Clarify docstring for Scalar functions ## How was this patch tested? Adds a unit test showing use similar to wordcount, there's existing unit test for array of floats as well. Closes #20908 from holdenk/SPARK-23672-document-support-for-nested-return-types-in-scalar-with-arrow-udfs. Authored-by: Holden Karau <[email protected]> Signed-off-by: Bryan Cutler <[email protected]> (cherry picked from commit da5685b) Signed-off-by: Bryan Cutler <[email protected]>

BryanCutler · 2018-09-10T18:31:25Z

merged to master and branch-2.4, thanks @holdenk !

holdenk added 2 commits March 26, 2018 11:18

Demonstrate tokenize udf

da8dbaf

Long lines are bad, kthnx

342d222

BryanCutler reviewed Mar 26, 2018

View reviewed changes

HyukjinKwon reviewed Mar 27, 2018

View reviewed changes

viirya reviewed Mar 27, 2018

View reviewed changes

Update description to clarify non-supported types are only map/struct…

88b65c5

…/nested arrays and add a test expected to fail for nested arrays, and fix test for tokenizing.

holdenk changed the title ~~[SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs~~ [WIP][SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs Mar 30, 2018

BryanCutler reviewed Apr 2, 2018

View reviewed changes

Fix long line

091a761

BryanCutler reviewed Apr 3, 2018

View reviewed changes

holdenk added 2 commits April 6, 2018 11:41

Switch nested array test to a positive test

729abc9

Merge branch 'master' into SPARK-23672-document-support-for-nested-re…

f5aeafc

…turn-types-in-scalar-with-arrow-udfs

Merge branch 'master' into SPARK-23672-document-support-for-nested-re…

7a42096

…turn-types-in-scalar-with-arrow-udfs

holdenk changed the title ~~[WIP][SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs~~ [SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs Aug 13, 2018

Clarify we can support nested array types and add a test demonstratin…

03798e0

…g it.

BryanCutler approved these changes Sep 10, 2018

View reviewed changes

asfgit closed this in da5685b Sep 10, 2018

[SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs #20908

[SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs #20908

Uh oh!

Conversation

holdenk commented Mar 26, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdenk commented Mar 26, 2018

Uh oh!

SparkQA commented Mar 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icexelloss Mar 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 30, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 2, 2018

Uh oh!

BryanCutler Apr 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 17, 2018

Uh oh!

holdenk commented May 4, 2018

Uh oh!

SparkQA commented May 5, 2018

Uh oh!

holdenk commented Jul 13, 2018

Uh oh!

SparkQA commented Jul 13, 2018

Uh oh!

SparkQA commented Aug 13, 2018

Uh oh!

SparkQA commented Aug 13, 2018

Uh oh!

holdenk commented Sep 9, 2018

Uh oh!

SparkQA commented Sep 9, 2018

Uh oh!

BryanCutler left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Sep 10, 2018

Uh oh!

Reviewers

Assignees

icexelloss Mar 28, 2018 •

edited

Loading

BryanCutler Apr 3, 2018 •

edited

Loading

BryanCutler left a comment •

edited

Loading