Skip to content

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented Mar 26, 2018

What changes were proposed in this pull request?

Clarify docstring for Scalar functions

How was this patch tested?

Adds a unit test showing use similar to wordcount, there's existing unit test for array of floats as well.

@holdenk
Copy link
Contributor Author

holdenk commented Mar 26, 2018

cc @BryanCutler

@SparkQA
Copy link

SparkQA commented Mar 26, 2018

Test build #88600 has finished for PR 20908 at commit 342d222.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

A scalar UDF defines a transformation: One or more `pandas.Series` -> A `pandas.Series`.
The returnType should be a primitive data type, e.g., :class:`DoubleType`.
The returnType should be a primitive data type, e.g., :class:`DoubleType` or
arrays of a primitive data type (e.g. :class:`ArrayType`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could now be more than just primitive types, I believe all of pyspark.type.DataTypes except for MapType, StructType, BinaryType and nested Arrays (that needs to be checked).

Copy link
Contributor

@icexelloss icexelloss Mar 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should (e.g. :class:ArrayType) be (e.g. :class:ArrayType(DoubleType))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked nested arrays do not currently work. I'll an explicit test to check that this fails, and when the test starts passing we can update the documentation.


def test_pandas_udf_tokenize(self):
from pyspark.sql.functions import pandas_udf
tokenize = pandas_udf(lambda s: s.apply(lambda str: str.split(' ')),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of s.apply with a lambda, you could do lambda s: s.str.split(' '), both do the same just a little more compact

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. I thought this PR targets to clarify array type wuth primitive types. can we improve the test case here -https://github.com/holdenk/spark/blob/342d2228a5c68fd2c07bd8c1b518da6135ce1bf6/python/pyspark/sql/tests.py#L3998, and remove this test case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have a test case for primitive array type as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a pretty common use to tokenize, so I think it's fine to have an explicit test for this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this PR targets to fix or support tokenizing in an udf ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon It doesn't, but given that the old documentation implied that the ionization usecase wouldn't work I thought it would be good to illustrate that it does in a test.

not _have_pandas or not _have_pyarrow,
_pandas_requirement_message or _pyarrow_requirement_message)
class PandasUDFTests(ReusedSQLTestCase):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems unrelated ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, however makes it fit with the style of the rest of the file.

self.assertEqual(udf.returnType, ArrayType(StringType()))
df = self.spark.createDataFrame([("hi boo",), ("bye boo",)], ["vals"])
result = df.select(tokenize("vals").alias("hi"))
self.assertEqual([], result.collect())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something? Is it equal to []?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya good point -- meant to update this.

…/nested arrays and add a test expected to fail for nested arrays, and fix test for tokenizing.
@holdenk holdenk changed the title [SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs [WIP][SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs Mar 30, 2018
@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #88765 has finished for PR 20908 at commit 88b65c5.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ArrayType(ArrayType(StringType())))
result = df.select(tokenize("vals").alias("hi"))
# If we start supporting nested arrays we should update the documentation in functions.py
self.assertRaises(ArrowTypeError, result.collect())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you put this under with QuietTest(self.sc): to suppress the error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good.

@SparkQA
Copy link

SparkQA commented Apr 2, 2018

Test build #88824 has finished for PR 20908 at commit 091a761.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

result = df.select(tokenize("vals").alias("hi"))
self.assertEqual([Row(hi=[u'hi', u'boo']), Row(hi=[u'bye', u'boo'])], result.collect())

def test_pandas_udf_nested_arrays_does_not_work(self):
Copy link
Member

@BryanCutler BryanCutler Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @holdenk , I should have been more clear about ArrayType support. Nested Arrays actually do work ok, it's primarily use with timestamps/dates that need to be adjusted, and lack of actual testing to verify it. I'll update SPARK-21187 to reflect this.

I ran the test below and it does work, you just need to define df from above and then the collected result is:
[Row(hi=[[u'hi', u'boo']]), Row(hi=[[u'bye', u'boo']])]

(also ArrowTypeError isn't defined, but should just be Exception and assertRaises is expecting a callable where result.collect() is)

If you could fix this up to test that nested arrays for types other than date/timestamps work, that would be great!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, that makes more sense.

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89452 has finished for PR 20908 at commit f5aeafc.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented May 4, 2018

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented May 5, 2018

Test build #90235 has finished for PR 20908 at commit f5aeafc.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Jul 13, 2018

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jul 13, 2018

Test build #92976 has finished for PR 20908 at commit f5aeafc.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk holdenk changed the title [WIP][SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs [SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs Aug 13, 2018
@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94697 has finished for PR 20908 at commit 7a42096.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94698 has finished for PR 20908 at commit 03798e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Sep 9, 2018

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Sep 9, 2018

Test build #95850 has finished for PR 20908 at commit 03798e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I ran the tests with pyarrow 0.10.0 just to verify there was no regression.

asfgit pushed a commit that referenced this pull request Sep 10, 2018
…lar with arrow udfs

## What changes were proposed in this pull request?

Clarify docstring for Scalar functions

## How was this patch tested?

Adds a unit test showing use similar to wordcount, there's existing unit test for array of floats as well.

Closes #20908 from holdenk/SPARK-23672-document-support-for-nested-return-types-in-scalar-with-arrow-udfs.

Authored-by: Holden Karau <[email protected]>
Signed-off-by: Bryan Cutler <[email protected]>
(cherry picked from commit da5685b)
Signed-off-by: Bryan Cutler <[email protected]>
@asfgit asfgit closed this in da5685b Sep 10, 2018
@BryanCutler
Copy link
Member

merged to master and branch-2.4, thanks @holdenk !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants