[SPARK-22901][PYTHON] Add deterministic flag to pyspark UDF #19929

mgaido91 · 2017-12-08T20:31:55Z

What changes were proposed in this pull request?

In SPARK-20586 the flag deterministic was added to Scala UDF, but it is not available for python UDF. This flag is useful for cases when the UDF's code can return different result with the same input. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query. This can lead to unexpected behavior.

This PR adds the deterministic flag, via the asNondeterministic method, to let the user mark the function as non-deterministic and therefore avoid the optimizations which might lead to strange behaviors.

How was this patch tested?

Manual tests:

>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import *
>>> df_br = spark.createDataFrame([{'name': 'hello'}])
>>> import random
>>> udf_random_col =  udf(lambda: int(100*random.random()), IntegerType()).asNondeterministic()
>>> df_br = df_br.withColumn('RAND', udf_random_col())
>>> random.seed(1234)
>>> udf_add_ten =  udf(lambda rand: rand + 10, IntegerType())
>>> df_br.withColumn('RAND_PLUS_TEN', udf_add_ten('RAND')).show()
+-----+----+-------------+                                                      
| name|RAND|RAND_PLUS_TEN|
+-----+----+-------------+
|hello|   3|           13|
+-----+----+-------------+

SparkQA · 2017-12-08T23:21:46Z

Test build #84660 has finished for PR 19929 at commit 6187d5a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2017-12-11T10:38:59Z

@gatorsmile sorry, I saw that you did the path for scala UDF. Might you help reviewing this please? Thanks.

mgaido91 · 2017-12-13T15:49:34Z

cc @cloud-fan @HyukjinKwon @zero323 maybe you can help too reviewing this, thanks.

mgaido91 · 2017-12-20T17:20:01Z

kindly ping @cloud-fan @gatorsmile @HyukjinKwon @zero323

gatorsmile · 2017-12-22T08:36:18Z

We need test cases. Manual tests are not enough. Also update registerPython

I will try to review this tomorrow. Thanks!

mgaido91 · 2017-12-22T12:38:12Z

@gatorsmile I added the test, but I didn't get what needs to be updated in registerPython. May you explain me please? Thanks.

cloud-fan · 2017-12-22T14:50:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDF.scala

    children: Seq[Expression],
-    evalType: Int)
+    evalType: Int,
+    udfDeterministic: Boolean = true)


do we need the default value?

no, it is not needed

cloud-fan · 2017-12-22T14:56:17Z

UDFRegistration.registerPython needs a minor update for the log

SparkQA · 2017-12-22T15:26:46Z

Test build #85309 has finished for PR 19929 at commit 187ff9a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2017-12-22T18:33:10Z

thank you @cloud-fan, changed.

SparkQA · 2017-12-22T21:00:11Z

Test build #85316 has finished for PR 19929 at commit cc309b0.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2017-12-22T22:20:42Z

Jenkins, retest this please

gatorsmile · 2017-12-23T00:22:10Z

python/pyspark/sql/tests.py

+        random.seed(1234)
+        udf_add_ten = udf(lambda rand: rand + 10, IntegerType())
+        [row] = df.withColumn('RAND_PLUS_TEN', udf_add_ten('RAND')).collect()
+        self.assertEqual(row[0] + 10, row[1])


Compare the values, since you already set the seed?

I am not sure why, but setting the seed doesn't seem to take effect. I will remove setting the seed.

gatorsmile · 2017-12-23T00:22:44Z

python/pyspark/sql/udf.py

+
+    def asNondeterministic(self):
+        """
+        Updates UserDefinedFunction to nondeterministic.


""" Updates UserDefinedFunction to nondeterministic. .. versionadded:: 2.3 """

gatorsmile · 2017-12-23T00:23:18Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

        | envVars: ${udf.func.envVars}
        | pythonIncludes: ${udf.func.pythonIncludes}
        | pythonExec: ${udf.func.pythonExec}
        | dataType: ${udf.dataType}


Could you also print out pythonEvalType?

gatorsmile · 2017-12-23T00:25:23Z

python/pyspark/sql/functions.py



 @since(1.3)
 def udf(f=None, returnType=StringType()):


Do we need to just add a parameter for deterministic? Adding it to the end is OK to PySpark without breaking the existing app? cc @ueshin

I followed what was done for scala UDF, where this parameter is not added, but there is a method to add it. If we add a parameter here, I'd then suggest to add it also to the scala API.

Scala and Python are different, because that is also for JAVA API.

@gatorsmile, however, wouldn't it be better to keep them consistent if possible?

I am saying this because I had few talks about this before and I am pretty sure we usually keep them as same whenever possible.

Using asNondeterministic is not straightforward to users. In Scala sides, we have no choice for avoiding breaking the API. Anyway, I am fine to keep it as now

SparkQA · 2017-12-23T01:00:50Z

Test build #85322 has finished for PR 19929 at commit cc309b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-23T13:11:45Z

Test build #85337 has finished for PR 19929 at commit 462a92a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-12-23T15:34:09Z

python/pyspark/sql/functions.py

        duplicate invocations may be eliminated or the function may even be invoked more times than
-        it is present in the query.
+        it is present in the query. If your function is not deterministic, call
+        `asNondeterministic`.


Let's say this more explicitly like .. call asNondeterministic() in the user-defined function. It's partly because I think UserDefinedFunction is not exposed in PySpark API doc.

yea, I think it is not clear here where to call asNondeterministic on. Maybe add a simple example.

viirya · 2017-12-24T10:09:52Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

+import org.apache.spark.api.python.PythonEvalType
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.api.java._
 import org.apache.spark.sql.catalyst.{JavaTypeInference, ScalaReflection}


UDFRegistration's doc:

* @note The user-defined functions must be deterministic.

Looks obsolete.

viirya · 2017-12-24T10:11:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala

    dataType: DataType,
-    pythonEvalType: Int) {
+    pythonEvalType: Int,
+    udfDeterministic: Boolean = true) {


Don't we always pass in this parameter? Remove the default value?

viirya · 2017-12-24T10:30:18Z

python/pyspark/sql/functions.py

    """Creates a user defined function (UDF).

-    .. note:: The user-defined functions must be deterministic. Due to optimization,
+    .. note:: The user-defined functions are considered deterministic. Due to optimization,


... are considered deterministic by default.

SparkQA · 2017-12-24T11:35:37Z

Test build #85358 has finished for PR 19929 at commit a40ba73.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-24T15:22:45Z

Test build #85359 has finished for PR 19929 at commit 47801c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-25T08:50:29Z

Any update on https://github.com/apache/spark/pull/19929/files/cc309b0ce2496365afd8c602c282e3d84aeed940#r158579661? Why seed does not work?

mgaido91 · 2017-12-25T19:30:21Z

@gatorsmile, yes, the reason why seed doesn't work is in the way Python UDFs are executed, i.e. a new python process is created for each partition to evaluate the Python UDF. Thus the seed is set only on the driver, but not in the process where the UDF is executed. What I am saying can be easily confirmed by this:

>>> from pyspark.sql.functions import udf
>>> import os
>>> pid_udf = udf(lambda: str(os.getpid()))
>>> spark.range(2).select(pid_udf()).show()
+----------+                                                                    
|<lambda>()|
+----------+
|      4132|
|      4130|
+----------+
>>> os.getpid()
4070

Therefore there is no easy way to set the seed. If I set it inside the UDF, the UDF would become deterministic. Therefore I think that the best option is the current test.

gatorsmile · 2017-12-26T06:34:44Z

Could you change the JIRA number to https://issues.apache.org/jira/browse/SPARK-22901 ?

mgaido91 · 2017-12-26T10:28:54Z

@gatorsmile done, thanks!

gatorsmile · 2017-12-26T14:40:00Z

LGTM

Thanks! Merged to master.

[SPARK-22629][PYTHON] Add deterministic flag to pyspark UDF

6187d5a

add test

187ff9a

cloud-fan reviewed Dec 22, 2017

View reviewed changes

address reviews

cc309b0

gatorsmile reviewed Dec 23, 2017

View reviewed changes

address review comments

462a92a

HyukjinKwon reviewed Dec 23, 2017

View reviewed changes

viirya reviewed Dec 24, 2017

View reviewed changes

update docs according to comments

a40ba73

fix MyDummyPythonUDF

47801c7

cloud-fan approved these changes Dec 26, 2017

View reviewed changes

mgaido91 changed the title ~~[SPARK-22629][PYTHON] Add deterministic flag to pyspark UDF~~ [SPARK-22901][PYTHON] Add deterministic flag to pyspark UDF Dec 26, 2017

asfgit closed this in ff48b1b Dec 26, 2017

[SPARK-22901][PYTHON] Add deterministic flag to pyspark UDF #19929

[SPARK-22901][PYTHON] Add deterministic flag to pyspark UDF #19929

Conversation

mgaido91 commented Dec 8, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 8, 2017

Uh oh!

mgaido91 commented Dec 11, 2017

Uh oh!

mgaido91 commented Dec 13, 2017

Uh oh!

mgaido91 commented Dec 20, 2017

Uh oh!

gatorsmile commented Dec 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgaido91 commented Dec 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 22, 2017

Uh oh!

mgaido91 commented Dec 22, 2017

Uh oh!

SparkQA commented Dec 22, 2017

Uh oh!

mgaido91 commented Dec 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 23, 2017

Uh oh!

SparkQA commented Dec 23, 2017

Uh oh!

HyukjinKwon Dec 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 24, 2017

Uh oh!

SparkQA commented Dec 24, 2017

Uh oh!

gatorsmile commented Dec 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgaido91 commented Dec 25, 2017

Uh oh!

gatorsmile commented Dec 22, 2017 •

edited

Loading

cloud-fan commented Dec 22, 2017 •

edited

Loading

HyukjinKwon Dec 23, 2017 •

edited

Loading

gatorsmile commented Dec 25, 2017 •

edited

Loading