-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22901][PYTHON] Add deterministic flag to pyspark UDF #19929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #84660 has finished for PR 19929 at commit
|
|
@gatorsmile sorry, I saw that you did the path for scala UDF. Might you help reviewing this please? Thanks. |
|
cc @cloud-fan @HyukjinKwon @zero323 maybe you can help too reviewing this, thanks. |
|
kindly ping @cloud-fan @gatorsmile @HyukjinKwon @zero323 |
|
We need test cases. Manual tests are not enough. Also update I will try to review this tomorrow. Thanks! |
|
@gatorsmile I added the test, but I didn't get what needs to be updated in |
| children: Seq[Expression], | ||
| evalType: Int) | ||
| evalType: Int, | ||
| udfDeterministic: Boolean = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it is not needed
|
|
|
Test build #85309 has finished for PR 19929 at commit
|
|
thank you @cloud-fan, changed. |
|
Test build #85316 has finished for PR 19929 at commit
|
|
Jenkins, retest this please |
| random.seed(1234) | ||
| udf_add_ten = udf(lambda rand: rand + 10, IntegerType()) | ||
| [row] = df.withColumn('RAND_PLUS_TEN', udf_add_ten('RAND')).collect() | ||
| self.assertEqual(row[0] + 10, row[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compare the values, since you already set the seed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why, but setting the seed doesn't seem to take effect. I will remove setting the seed.
|
|
||
| def asNondeterministic(self): | ||
| """ | ||
| Updates UserDefinedFunction to nondeterministic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""
Updates UserDefinedFunction to nondeterministic.
.. versionadded:: 2.3
"""
| | envVars: ${udf.func.envVars} | ||
| | pythonIncludes: ${udf.func.pythonIncludes} | ||
| | pythonExec: ${udf.func.pythonExec} | ||
| | dataType: ${udf.dataType} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also print out pythonEvalType?
|
|
||
|
|
||
| @since(1.3) | ||
| def udf(f=None, returnType=StringType()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to just add a parameter for deterministic? Adding it to the end is OK to PySpark without breaking the existing app? cc @ueshin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed what was done for scala UDF, where this parameter is not added, but there is a method to add it. If we add a parameter here, I'd then suggest to add it also to the scala API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scala and Python are different, because that is also for JAVA API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile, however, wouldn't it be better to keep them consistent if possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am saying this because I had few talks about this before and I am pretty sure we usually keep them as same whenever possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using asNondeterministic is not straightforward to users. In Scala sides, we have no choice for avoiding breaking the API. Anyway, I am fine to keep it as now
|
Test build #85322 has finished for PR 19929 at commit
|
|
Test build #85337 has finished for PR 19929 at commit
|
python/pyspark/sql/functions.py
Outdated
| duplicate invocations may be eliminated or the function may even be invoked more times than | ||
| it is present in the query. | ||
| it is present in the query. If your function is not deterministic, call | ||
| `asNondeterministic`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say this more explicitly like .. call asNondeterministic() in the user-defined function. It's partly because I think UserDefinedFunction is not exposed in PySpark API doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I think it is not clear here where to call asNondeterministic on. Maybe add a simple example.
| import org.apache.spark.api.python.PythonEvalType | ||
| import org.apache.spark.internal.Logging | ||
| import org.apache.spark.sql.api.java._ | ||
| import org.apache.spark.sql.catalyst.{JavaTypeInference, ScalaReflection} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UDFRegistration's doc:
* @note The user-defined functions must be deterministic.
Looks obsolete.
| dataType: DataType, | ||
| pythonEvalType: Int) { | ||
| pythonEvalType: Int, | ||
| udfDeterministic: Boolean = true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we always pass in this parameter? Remove the default value?
python/pyspark/sql/functions.py
Outdated
| """Creates a user defined function (UDF). | ||
| .. note:: The user-defined functions must be deterministic. Due to optimization, | ||
| .. note:: The user-defined functions are considered deterministic. Due to optimization, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... are considered deterministic by default.
|
Test build #85358 has finished for PR 19929 at commit
|
|
Test build #85359 has finished for PR 19929 at commit
|
|
Any update on https://github.com/apache/spark/pull/19929/files/cc309b0ce2496365afd8c602c282e3d84aeed940#r158579661? Why seed does not work? |
|
@gatorsmile, yes, the reason why seed doesn't work is in the way Python UDFs are executed, i.e. a new python process is created for each partition to evaluate the Python UDF. Thus the seed is set only on the driver, but not in the process where the UDF is executed. What I am saying can be easily confirmed by this: Therefore there is no easy way to set the seed. If I set it inside the UDF, the UDF would become deterministic. Therefore I think that the best option is the current test. |
|
Could you change the JIRA number to https://issues.apache.org/jira/browse/SPARK-22901 ? |
|
@gatorsmile done, thanks! |
|
LGTM Thanks! Merged to master. |
What changes were proposed in this pull request?
In SPARK-20586 the flag
deterministicwas added to Scala UDF, but it is not available for python UDF. This flag is useful for cases when the UDF's code can return different result with the same input. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query. This can lead to unexpected behavior.This PR adds the deterministic flag, via the
asNondeterministicmethod, to let the user mark the function as non-deterministic and therefore avoid the optimizations which might lead to strange behaviors.How was this patch tested?
Manual tests: