Skip to content

Conversation

@beliefer
Copy link
Contributor

What changes were proposed in this pull request?

The Scala API exists a method call_udf used to call the user-defined functions.
In fact, call_udf also could call the builtin functions.
The behavior is confused for users.

This PR adds call_function to replace call_udf and deprecate call_udf for Scala API.

Why are the changes needed?

Fix the confusion of call_udf.

Does this PR introduce any user-facing change?

'No'.
New feature.

How was this patch tested?

Exists test cases.

@beliefer
Copy link
Contributor Author

ping @cloud-fan @zhengruifeng cc @HyukjinKwon

@cloud-fan
Copy link
Contributor

I'm fine with it as the API is the same as invoking a SQL function. cc @dongjoon-hyun @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this new method is not udf_funcs group, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. it's not.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we need @scala.annotation.varargs.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scala.annotation.varargs?

Copy link
Contributor

@zhengruifeng zhengruifeng Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq:
1, dose parameter isDistinct works for all the functions?
2, since call_function can invoke both built-in functions and udfs, shall we add a new parameter (may support options built-in-only, udf-only, global) to specify where to look up the function? so that we can invoke a built-in function even if it is overrided by udfs, e.g.

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = Seq(("a", "A"), ("b", "B")).toDF("key", "value")
df: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> df.selectExpr("upper(key)").show
+----------+
|upper(key)|
+----------+
|         A|
|         B|
+----------+


scala> val foo = udf{str: String => str.toUpperCase() + "_X"}
foo: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$3089/0x000000080140f040@6ccb2162,StringType,List(Some(class[value[0]: string])),Some(class[value[0]: string]),None,true,true)

scala> spark.udf.register("upper", foo.asNondeterministic())
23/06/26 15:15:09 WARN SimpleFunctionRegistry: The function upper replaced a previously registered function.
res1: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$3089/0x000000080140f040@6ccb2162,StringType,List(Some(class[value[0]: string])),Some(class[value[0]: string]),Some(upper),true,false)

scala> df.selectExpr("upper(key)").show
+----------+
|upper(key)|
+----------+
|       A_X|
|       B_X|
+----------+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng Good point. But it seems the behavior is consistent with SQL syntax. cc @cloud-fan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first point, isDistinct doesn't works for all functions. isDistinct is used by aggregate or window functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the second point, according to the discussion between @cloud-fan and me, it's not consistent with SQL syntax.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1, since isDistinct doesn't works for all functions, what is the behavior if user invoke unsupported functions like call_function("abs", true) ?

2, I guess we can simply make it private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first point, could we add call_aggregate_function and call_window_function with isDistinct parameter ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not expose isDistinct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. let's not expose isDistinct now. If we need isDistinct in future, add it then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realize that it maybe problematic in such cases, if some users happen to register a udf with the same name ceil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. If we want avoid this issue, it seems we should make the built-in-only, udf-only, global as you said.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, if the goal of this PR is to replace call_udf with call_function, we can resolve this naming conflict issue in another PRs.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with this too

@beliefer
Copy link
Contributor Author

beliefer commented Jul 7, 2023

ping @zhengruifeng Please take a review again. cc @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add this new function to Spark Connect in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@zhengruifeng
Copy link
Contributor

merged to master

@beliefer
Copy link
Contributor Author

@zhengruifeng @HyukjinKwon @cloud-fan @dongjoon-hyun Thank you for all!

@cloud-fan
Copy link
Contributor

Since we are adding a new API, shall we make it more like the SQL syntax? e.g. the function name can be qualified, so that people can use it to invoke persistent functions as well.

@rxin
Copy link
Contributor

rxin commented Jul 12, 2023

Why would we deprecate the old one? This just creates tons of warning messages for users. If I was an end user, I'd be super annoyed that Spark just decided to rename things and generate tons of warnings with an upgrade. We don't really gain much by deprecating the existing one.

@rxin
Copy link
Contributor

rxin commented Jul 12, 2023

BTW it's even more "amusing" and frustrating if you look at the history.

We decided to deprecate callUDF which has existed since Spark 1.5 for call_udf in Spark 3.2, and then 3 or 4 releases later we decided to depreciate call_udf because "call_udf" can also call built-in functions, so now users need to deal with all these deprecation warnings to update their codebase to use call_function.

@zhengruifeng
Copy link
Contributor

ok, let me remove the warning messages introduced in this PR.
For a user who just want to call a UDF (I think users are likely unware of the fact that call_udf also call built-in functions), it is meaningless to suggest he switch to a new API.

zhengruifeng added a commit that referenced this pull request Jul 12, 2023
### What changes were proposed in this pull request?
Revert the deprecation message changes

### Why are the changes needed?
to address #41687 (comment)

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
existing CI

Closes #41950 from zhengruifeng/sql_call_function_warning.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
cloud-fan pushed a commit that referenced this pull request Jul 25, 2023
…ion name for call_function

### What changes were proposed in this pull request?
#41687 added `call_function` and deprecate `call_udf` for Scala API.

Some times, the function name can be qualified, we should let users use it to invoke persistent functions as well.

### Why are the changes needed?
Support qualified function name for `call_function`.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New test cases.

Closes #41932 from beliefer/SPARK-44131_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Jul 25, 2023
…ion name for call_function

### What changes were proposed in this pull request?
#41687 added `call_function` and deprecate `call_udf` for Scala API.

Some times, the function name can be qualified, we should let users use it to invoke persistent functions as well.

### Why are the changes needed?
Support qualified function name for `call_function`.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New test cases.

Closes #41932 from beliefer/SPARK-44131_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d97a4e2)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants