-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Python #19147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #81452 has finished for PR 19147 at commit
|
| private val accumulator = funcs.head.funcs.head.accumulator | ||
|
|
||
| // todo: return column batch? | ||
| def compute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class duplicates quite a bit of logic of PythonRDD. I think the only difference is how they serialize/deserialize data (non-arrow vs arrow). @ueshin @BryanCutler what's your thought on refactoring this and PythonRDD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that we should refactor PythonRunners, but I'm not sure what you mean by refactoring PythonRDD? I think it's already simple enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I meant PythonRunner in PythonRDD.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a lot of duplicated code from PythonRunner that could be refactored. I'm guessing you did not use the existing code because of the Arrow stream format? While I would love to start using that in Spark, I think it would be better to do this at a later time when the required code could be refactored and the Arrow stream format could replace where we currently use the file format.
Also, the good part about using the iterator based file format is each iteration can allow Python to communicate back an error code and exit gracefully. In my own tests with the streaming format if an error occurred after the stream had started, Spark could lock up in a waiting state. These are the reasons I did not use the streaming format in my implementation. Would this VectorizedPythonRunner be able to handle these types of errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@icexelloss Ah, I see, thanks! I still agree with refactoring PythonRunner.
@BryanCutler As for the error, do you mean the case like test_vectorized_udf_exception? If not, could you please let me know the case and think about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the protocol between Scala and Python that is changed here and could act differently under some circumstances. Here is the behavior of the PythonRunner protocol and VectorizedPythonRunner protocol that you introduce here:
PythonRunner
Data blocks are framed by a special length integer. Scala reads each data block one at a time and checks the length code. If the code is a PythonException, the error is read from Python and a SparkException is thrown with that being the cause.
VectorizedPythonRunner
A data stream is opened in Scala with ArrowStreamReader and batches are transferred until ArrowStreamReader returns False indicating there is no more data. Only at this point are the special length codes checked to handle an error from Python.
This behavior change would probably only cause problems if things are not working normally. For example, what would happen if pyarrow was not installed on an executor? With PythonRunner the ImportError would cause a PythonException to be transferred and thrown in Scala. In VectorizedPythonRunner I believe the ArrowStreamReader would try to read the special length code and then fail somewhere internally to Arrow, not showing the ImportError.
My point was that this type of behavior change should probably be implemented in a separate JIRA where we could make sure to handle all of these cases.
| private def loadNextBatch(): Boolean = { | ||
| batchLoaded = reader.loadNextBatch() | ||
| if (batchLoaded) { | ||
| val batch = new ColumnarBatch(schema, vectors, root.getRowCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A side note: How is performance of ColumnarBatch in terms of converting to a Iterator[InternalRow]? As far as I remember it doesn't return unsafe row at the moment, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, it doesn't return unsafe row.
But I believe it's performant enough because it can return values in row directly from column vectors without copying to unsafe row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, does this mean it's more performant than copying the column vectors into unsafe row? Because to get any value out, it has to access the memory region of column vectors any way, therefore copying bytes from column vectors into unsafe row don't improve performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why copying bytes from column vectors into unsafe row can improve performance? Isn't direct access the data from the column vectors faster without the cost of copying bytes?
|
Test build #81453 has finished for PR 19147 at commit
|
|
relaying my questions from the dev@ thread: Would it be correct to assume there will be data type check, for example the returned pandas data frame column data types match what are specified. We have seen quite a bit of issues/confusions with that in R. Would it make sense to have a more generic decorator name so that it could also be useable for other efficient vectorized format in the future? Or do we anticipate the decorator to be format specific and will have more in the future? |
|
@felixcheung Thank you for your comment. As for decorator name, we can have a more generic decorator name but we need the format name or something to know the format anyway to know what users want, and also what if some format needs additional options. IMO, it's ok to have the format specific decorators for users to understand each format spec. |
|
Test build #81516 has finished for PR 19147 at commit
|
|
Test build #81519 has finished for PR 19147 at commit
|
|
The test failure above should be fixed by #19158. |
|
Jenkins, retest this please. |
|
Test build #81538 has finished for PR 19147 at commit
|
|
retest this please |
|
Test build #81545 has finished for PR 19147 at commit
|
| with self.assertRaisesRegexp( | ||
| Exception, | ||
| 'The length of returned value should be the same as input value'): | ||
| df.select(raise_exception()).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a test for mixing udf and vectorized udf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add a test.
| * there should be always some rows buffered in the socket or Python process, so the pulling from | ||
| * RowQueue ALWAYS happened after pushing into it. | ||
| */ | ||
| case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename BatchEvalPythonExec as it is not just for batched python udfs now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about BlockedEvalPythonExec or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it is better than BatchEvalPythonExec. Don't know if others have any suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Let's see if others have any suggestions for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BlockedEvalPythonExec sounds better to me too but I think I don't have a strong preference ..
|
Test build #81621 has finished for PR 19147 at commit
|
|
retest this please |
|
Test build #81629 has finished for PR 19147 at commit
|
|
Test build #81635 has finished for PR 19147 at commit
|
| return "UTF8Deserializer(%s)" % self.use_unicode | ||
|
|
||
|
|
||
| class VectorizedSerializer(Serializer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrowVectorizedSerializer?
| if _have_pandas and _have_arrow: | ||
|
|
||
| @since(2.3) | ||
| def pandas_udf(f=None, returnType=StringType()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of hiding pandas_udf when no pandas and arrow installed, should we throw a exception if users without pandas and arrow try to use it?
| .. note:: The vectorized user-defined functions must be deterministic. Due to optimization, | ||
| duplicate invocations may be eliminated or the function may even be invoked more times | ||
| than it is present in the query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we explain more about what the vectorized UDF is and its expected input parameters and outputs?
|
I'd close this in favor of #18659. |
What changes were proposed in this pull request?
This pr introduces vectorized UDFs in Python.
Note that this pr should focus on APIs for vectorized UDFs, not APIs for vectorized UDAFs or Window operations.
Proposed API
We introduce a
@pandas_udfdecorator (or annotation) to define vectorized UDFs which takes one or morepandas.Seriesor one integer value meaning the length of the input value for 0-parameter UDFs. The return value should bepandas.Seriesof the specified type and the length of the returned value should be the same as input value.We can define vectorized UDFs as:
or we can define as:
We can use it similar to row-by-row UDFs:
As for 0-parameter UDFs, we can define and use as:
How was this patch tested?
Added tests and existing tests.
TBD