[WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Python #19147

ueshin · 2017-09-06T12:16:40Z

What changes were proposed in this pull request?

This pr introduces vectorized UDFs in Python.
Note that this pr should focus on APIs for vectorized UDFs, not APIs for vectorized UDAFs or Window operations.

Proposed API

We introduce a @pandas_udf decorator (or annotation) to define vectorized UDFs which takes one or more pandas.Series or one integer value meaning the length of the input value for 0-parameter UDFs. The return value should be pandas.Series of the specified type and the length of the returned value should be the same as input value.

We can define vectorized UDFs as:

  @pandas_udf(DoubleType())
  def plus(v1, v2):
      return v1 + v2

or we can define as:

  plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())

We can use it similar to row-by-row UDFs:

  df.withColumn('sum', plus(df.v1, df.v2))

As for 0-parameter UDFs, we can define and use as:

  @pandas_udf(LongType())
  def f0(size):
      return pd.Series(1).repeat(size)

  df.select(f0())

How was this patch tested?

Added tests and existing tests.

TBD

the way to specify the size hint to 0-parameter UDF (or for more parameter UDFs to be consistent)

SparkQA · 2017-09-06T12:19:33Z

Test build #81452 has finished for PR 19147 at commit a1e4f62.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2017-09-06T14:37:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/VectorizedPythonRunner.scala

+  private val accumulator = funcs.head.funcs.head.accumulator
+
+  // todo: return column batch?
+  def compute(


This class duplicates quite a bit of logic of PythonRDD. I think the only difference is how they serialize/deserialize data (non-arrow vs arrow). @ueshin @BryanCutler what's your thought on refactoring this and PythonRDD?

I agree with you that we should refactor PythonRunners, but I'm not sure what you mean by refactoring PythonRDD? I think it's already simple enough.

Yes I meant PythonRunner in PythonRDD.scala

Yes, it is a lot of duplicated code from PythonRunner that could be refactored. I'm guessing you did not use the existing code because of the Arrow stream format? While I would love to start using that in Spark, I think it would be better to do this at a later time when the required code could be refactored and the Arrow stream format could replace where we currently use the file format.

Also, the good part about using the iterator based file format is each iteration can allow Python to communicate back an error code and exit gracefully. In my own tests with the streaming format if an error occurred after the stream had started, Spark could lock up in a waiting state. These are the reasons I did not use the streaming format in my implementation. Would this VectorizedPythonRunner be able to handle these types of errors?

@icexelloss Ah, I see, thanks! I still agree with refactoring PythonRunner.

@BryanCutler As for the error, do you mean the case like test_vectorized_udf_exception? If not, could you please let me know the case and think about it?

I was referring to the protocol between Scala and Python that is changed here and could act differently under some circumstances. Here is the behavior of the PythonRunner protocol and VectorizedPythonRunner protocol that you introduce here:

PythonRunner
Data blocks are framed by a special length integer. Scala reads each data block one at a time and checks the length code. If the code is a PythonException, the error is read from Python and a SparkException is thrown with that being the cause.

VectorizedPythonRunner
A data stream is opened in Scala with ArrowStreamReader and batches are transferred until ArrowStreamReader returns False indicating there is no more data. Only at this point are the special length codes checked to handle an error from Python.

This behavior change would probably only cause problems if things are not working normally. For example, what would happen if pyarrow was not installed on an executor? With PythonRunner the ImportError would cause a PythonException to be transferred and thrown in Scala. In VectorizedPythonRunner I believe the ArrowStreamReader would try to read the special length code and then fail somewhere internally to Arrow, not showing the ImportError.

My point was that this type of behavior change should probably be implemented in a separate JIRA where we could make sure to handle all of these cases.

icexelloss · 2017-09-06T14:39:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/VectorizedPythonRunner.scala

+      private def loadNextBatch(): Boolean = {
+        batchLoaded = reader.loadNextBatch()
+        if (batchLoaded) {
+          val batch = new ColumnarBatch(schema, vectors, root.getRowCount)


A side note: How is performance of ColumnarBatch in terms of converting to a Iterator[InternalRow]? As far as I remember it doesn't return unsafe row at the moment, right?

That's right, it doesn't return unsafe row.
But I believe it's performant enough because it can return values in row directly from column vectors without copying to unsafe row.

Interesting, does this mean it's more performant than copying the column vectors into unsafe row? Because to get any value out, it has to access the memory region of column vectors any way, therefore copying bytes from column vectors into unsafe row don't improve performance?

Why copying bytes from column vectors into unsafe row can improve performance? Isn't direct access the data from the column vectors faster without the cost of copying bytes?

SparkQA · 2017-09-06T14:57:46Z

Test build #81453 has finished for PR 19147 at commit 84d2767.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-09-06T16:39:45Z

relaying my questions from the dev@ thread:
+1 for type in string form.

Would it be correct to assume there will be data type check, for example the returned pandas data frame column data types match what are specified. We have seen quite a bit of issues/confusions with that in R.

Would it make sense to have a more generic decorator name so that it could also be useable for other efficient vectorized format in the future? Or do we anticipate the decorator to be format specific and will have more in the future?

ueshin · 2017-09-07T12:16:04Z

@felixcheung Thank you for your comment.
We already support data type in string form. I'll add a test to confirm it.

As for decorator name, we can have a more generic decorator name but we need the format name or something to know the format anyway to know what users want, and also what if some format needs additional options. IMO, it's ok to have the format specific decorators for users to understand each format spec.

SparkQA · 2017-09-07T14:39:32Z

Test build #81516 has finished for PR 19147 at commit 3a0d4a6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-07T17:42:34Z

Test build #81519 has finished for PR 19147 at commit 2f929d8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-09-08T02:44:28Z

The test failure above should be fixed by #19158.

ueshin · 2017-09-08T05:36:29Z

Jenkins, retest this please.

SparkQA · 2017-09-08T07:04:46Z

Test build #81538 has finished for PR 19147 at commit 2f929d8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-08T07:24:59Z

retest this please

SparkQA · 2017-09-08T10:03:36Z

Test build #81545 has finished for PR 19147 at commit 2f929d8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-10T09:46:43Z

python/pyspark/sql/tests.py

+            with self.assertRaisesRegexp(
+                    Exception,
+                    'The length of returned value should be the same as input value'):
+                df.select(raise_exception()).collect()


Also add a test for mixing udf and vectorized udf?

Sure, I'll add a test.

viirya · 2017-09-10T09:53:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala

 * there should be always some rows buffered in the socket or Python process, so the pulling from
 * RowQueue ALWAYS happened after pushing into it.
 */
 case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan)


Maybe rename BatchEvalPythonExec as it is not just for batched python udfs now.

How about BlockedEvalPythonExec or something?

I feel it is better than BatchEvalPythonExec. Don't know if others have any suggestion.

Thanks! Let's see if others have any suggestions for a while.

BlockedEvalPythonExec sounds better to me too but I think I don't have a strong preference ..

SparkQA · 2017-09-11T07:04:46Z

Test build #81621 has finished for PR 19147 at commit dbc6dd2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-11T07:26:57Z

retest this please

SparkQA · 2017-09-11T10:46:32Z

Test build #81629 has finished for PR 19147 at commit dbc6dd2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-11T11:50:42Z

Test build #81635 has finished for PR 19147 at commit 803054e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-14T08:40:58Z

python/pyspark/serializers.py

        return "UTF8Deserializer(%s)" % self.use_unicode


+class VectorizedSerializer(Serializer):


ArrowVectorizedSerializer?

viirya · 2017-09-14T08:42:54Z

python/pyspark/sql/functions.py

+if _have_pandas and _have_arrow:
+
+    @since(2.3)
+    def pandas_udf(f=None, returnType=StringType()):


Instead of hiding pandas_udf when no pandas and arrow installed, should we throw a exception if users without pandas and arrow try to use it?

viirya · 2017-09-14T08:44:45Z

python/pyspark/sql/functions.py

+
+        .. note:: The vectorized user-defined functions must be deterministic. Due to optimization,
+            duplicate invocations may be eliminated or the function may even be invoked more times
+            than it is present in the query.


Should we explain more about what the vectorized UDF is and its expected input parameters and outputs?

ueshin · 2017-09-22T08:37:46Z

I'd close this in favor of #18659.

ueshin added 2 commits September 6, 2017 20:31

Introduce vectorized UDF in Python.

a2a3f82

Add check if the length of returned value is the same as input value.

a1e4f62

Fix style.

84d2767

icexelloss reviewed Sep 6, 2017

View reviewed changes

ueshin added 2 commits September 7, 2017 21:16

Check if pandas is installed or not.

1db6cb5

Add a test using datatype in string form.

3a0d4a6

Fix tests.

2f929d8

viirya reviewed Sep 10, 2017

View reviewed changes

Fix tests.

dbc6dd2

Add a test for mixing udf and vectorized udf.

803054e

cloud-fan mentioned this pull request Sep 12, 2017

[SPARK-21190][PYSPARK] Python Vectorized UDFs #18659

Closed

5 tasks

ueshin mentioned this pull request Sep 13, 2017

Add tests for vectorized UDF. BryanCutler/spark#26

Closed

viirya reviewed Sep 14, 2017

View reviewed changes

ueshin closed this Sep 22, 2017

		return "UTF8Deserializer(%s)" % self.use_unicode


		class VectorizedSerializer(Serializer):

[WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Python #19147

[WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Python #19147

Uh oh!

Conversation

ueshin commented Sep 6, 2017

What changes were proposed in this pull request?

How was this patch tested?

TBD

Uh oh!

SparkQA commented Sep 6, 2017

Uh oh!

icexelloss Sep 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Sep 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 6, 2017

Uh oh!

felixcheung commented Sep 6, 2017

Uh oh!

ueshin commented Sep 7, 2017

Uh oh!

SparkQA commented Sep 7, 2017

Uh oh!

SparkQA commented Sep 7, 2017

Uh oh!

ueshin commented Sep 8, 2017

Uh oh!

ueshin commented Sep 8, 2017

Uh oh!

SparkQA commented Sep 8, 2017

Uh oh!

HyukjinKwon commented Sep 8, 2017

Uh oh!

SparkQA commented Sep 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 11, 2017

Uh oh!

HyukjinKwon commented Sep 11, 2017

Uh oh!

SparkQA commented Sep 11, 2017

Uh oh!

SparkQA commented Sep 11, 2017

Uh oh!

Choose a reason for hiding this comment

icexelloss Sep 6, 2017 •

edited

Loading

ueshin Sep 11, 2017 •

edited

Loading

HyukjinKwon Sep 11, 2017 •

edited

Loading