-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-44131][SQL] Add call_function and deprecate call_udf for Scala API #41687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ping @cloud-fan @zhengruifeng cc @HyukjinKwon |
|
I'm fine with it as the API is the same as invoking a SQL function. cc @dongjoon-hyun @HyukjinKwon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this new method is not udf_funcs group, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. it's not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we need @scala.annotation.varargs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scala.annotation.varargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq:
1, dose parameter isDistinct works for all the functions?
2, since call_function can invoke both built-in functions and udfs, shall we add a new parameter (may support options built-in-only, udf-only, global) to specify where to look up the function? so that we can invoke a built-in function even if it is overrided by udfs, e.g.
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val df = Seq(("a", "A"), ("b", "B")).toDF("key", "value")
df: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> df.selectExpr("upper(key)").show
+----------+
|upper(key)|
+----------+
| A|
| B|
+----------+
scala> val foo = udf{str: String => str.toUpperCase() + "_X"}
foo: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$3089/0x000000080140f040@6ccb2162,StringType,List(Some(class[value[0]: string])),Some(class[value[0]: string]),None,true,true)
scala> spark.udf.register("upper", foo.asNondeterministic())
23/06/26 15:15:09 WARN SimpleFunctionRegistry: The function upper replaced a previously registered function.
res1: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$3089/0x000000080140f040@6ccb2162,StringType,List(Some(class[value[0]: string])),Some(class[value[0]: string]),Some(upper),true,false)
scala> df.selectExpr("upper(key)").show
+----------+
|upper(key)|
+----------+
| A_X|
| B_X|
+----------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhengruifeng Good point. But it seems the behavior is consistent with SQL syntax. cc @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first point, isDistinct doesn't works for all functions. isDistinct is used by aggregate or window functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the second point, according to the discussion between @cloud-fan and me, it's not consistent with SQL syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1, since isDistinct doesn't works for all functions, what is the behavior if user invoke unsupported functions like call_function("abs", true) ?
2, I guess we can simply make it private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first point, could we add call_aggregate_function and call_window_function with isDistinct parameter ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not expose isDistinct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. let's not expose isDistinct now. If we need isDistinct in future, add it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realize that it maybe problematic in such cases, if some users happen to register a udf with the same name ceil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. If we want avoid this issue, it seems we should make the built-in-only, udf-only, global as you said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, if the goal of this PR is to replace call_udf with call_function, we can resolve this naming conflict issue in another PRs.
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with this too
|
ping @zhengruifeng Please take a review again. cc @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add this new function to Spark Connect in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
merged to master |
|
@zhengruifeng @HyukjinKwon @cloud-fan @dongjoon-hyun Thank you for all! |
|
Since we are adding a new API, shall we make it more like the SQL syntax? e.g. the function name can be qualified, so that people can use it to invoke persistent functions as well. |
|
Why would we deprecate the old one? This just creates tons of warning messages for users. If I was an end user, I'd be super annoyed that Spark just decided to rename things and generate tons of warnings with an upgrade. We don't really gain much by deprecating the existing one. |
|
BTW it's even more "amusing" and frustrating if you look at the history. We decided to deprecate callUDF which has existed since Spark 1.5 for call_udf in Spark 3.2, and then 3 or 4 releases later we decided to depreciate call_udf because "call_udf" can also call built-in functions, so now users need to deal with all these deprecation warnings to update their codebase to use call_function. |
|
ok, let me remove the warning messages introduced in this PR. |
### What changes were proposed in this pull request? Revert the deprecation message changes ### Why are the changes needed? to address #41687 (comment) ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? existing CI Closes #41950 from zhengruifeng/sql_call_function_warning. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
…ion name for call_function ### What changes were proposed in this pull request? #41687 added `call_function` and deprecate `call_udf` for Scala API. Some times, the function name can be qualified, we should let users use it to invoke persistent functions as well. ### Why are the changes needed? Support qualified function name for `call_function`. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes #41932 from beliefer/SPARK-44131_followup. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ion name for call_function ### What changes were proposed in this pull request? #41687 added `call_function` and deprecate `call_udf` for Scala API. Some times, the function name can be qualified, we should let users use it to invoke persistent functions as well. ### Why are the changes needed? Support qualified function name for `call_function`. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes #41932 from beliefer/SPARK-44131_followup. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d97a4e2) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
The Scala API exists a method
call_udfused to call the user-defined functions.In fact,
call_udfalso could call the builtin functions.The behavior is confused for users.
This PR adds
call_functionto replacecall_udfand deprecatecall_udffor Scala API.Why are the changes needed?
Fix the confusion of
call_udf.Does this PR introduce any user-facing change?
'No'.
New feature.
How was this patch tested?
Exists test cases.