ScalarUDF: Remove `supports_zero_argument` and avoid creating null array for empty args #10193

jayzhan211 · 2024-04-23T09:10:00Z

Which issue does this PR close?

Issue from https://github.com/apache/datafusion/pull/10098/files/a422b3ab10519efea3ec671fef944a1dcf8aaf96#r1573855672

Closes #10205 .
Closes #10247

Rationale for this change

What changes are included in this PR?

Remove supports_zero_argument
~~Introduce support_randomness~~
Introduce invoke_no_args

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <[email protected]>

datafusion/physical-expr/src/scalar_function.rs

Signed-off-by: jayzhan211 <[email protected]>

alamb

Looking very nice

alamb · 2024-04-23T09:53:32Z

datafusion/functions/src/math/random.rs

+    let len = if args.is_empty() {
+        1
+    } else {
+        return exec_err!("Expect random function to take no param");


This might be more consistent with the other changes in this PR:

Suggested change

return exec_err!("Expect random function to take no param");

return exec_err!("Expect {} function to take no parameters", self.name());

I did this before, and I noticed that random is used in test so I need to move it out as an independent function, but adding name as another argument looks so weird, so I keep it like this.

we can even return ScalarValue::F64 instead of Array. Let me fix these too

alamb · 2024-04-23T09:54:05Z

datafusion/core/tests/user_defined/user_defined_scalar_functions.rs

+        let len = if args.is_empty() {
+            1
+        } else {
+            return internal_err!("Invalid argument type");


It might be good to have the test match the rest of the functions for consistency

Suggested change

return internal_err!("Invalid argument type");

return exec_err!("Expect function to take no parameters");

alamb · 2024-04-23T09:54:58Z

datafusion/functions/src/math/pi.rs

-            return exec_err!("Expect pi function to take no param");
-        }
+        if !args.is_empty() {
+            return exec_err!("Expect {} function to take no param", self.name());


Suggested change

return exec_err!("Expect {} function to take no param", self.name());

return exec_err!("Expect {} function to take no parameters", self.name());

alamb · 2024-04-23T09:55:48Z

datafusion/functions/src/string/uuid.rs

+        let len = if args.is_empty() {
+            1
+        } else {
+            return exec_err!("Expect {} function to take no param", self.name());


Suggested change

return exec_err!("Expect {} function to take no param", self.name());

return exec_err!("Expect {} function to take no parameters", self.name());

datafusion/physical-expr/src/scalar_function.rs

jayzhan211 · 2024-04-23T12:19:51Z

datafusion/functions/src/math/random.rs


    #[test]
    fn test_random_expression() {
-        let args = vec![ColumnarValue::Array(Arc::new(NullArray::new(1)))];
-        let array = random(&args)
+        let array = random(&[])


remove this test, since it is already covered

datafusion/datafusion/sqllogictest/test_files/expr.slt

Lines 788 to 790 in da82cec

SELECT

random() BETWEEN 0.0 AND 1.0,

random() = random()

Signed-off-by: jayzhan211 <[email protected]>

jayzhan211 · 2024-04-23T12:43:54Z

datafusion/core/tests/user_defined/user_defined_scalar_functions.rs

@@ -403,123 +398,6 @@ async fn test_user_defined_functions_with_alias() -> Result<()> {
    Ok(())
 }

-#[derive(Debug)]
-pub struct RandomUDF {


We have RandomFunc, they are the same so remove it.

Signed-off-by: jayzhan211 <[email protected]>

jayzhan211 · 2024-04-23T13:44:30Z

datafusion/functions/src/math/pi.rs

-        }
-        let array = Float64Array::from_value(std::f64::consts::PI, 1);
-        Ok(ColumnarValue::Array(Arc::new(array)))
+    fn invoke(&self, _args: &[ColumnarValue]) -> Result<ColumnarValue> {


I think signature check is enough, so just ignore args

alamb · 2024-04-23T17:54:21Z

datafusion/core/tests/user_defined/user_defined_scalar_functions.rs

-}
-
-#[tokio::test]
-async fn deregister_udf() -> Result<()> {


This function is just moved , the test remains

alamb · 2024-04-23T17:54:30Z

datafusion/core/tests/user_defined/user_defined_scalar_functions.rs

@@ -615,6 +493,22 @@ async fn test_user_defined_functions_cast_to_i64() -> Result<()> {
    Ok(())
 }

+#[tokio::test]
+async fn deregister_udf() -> Result<()> {


alamb · 2024-04-23T17:55:57Z

datafusion/expr/src/udf.rs

-    /// # Zero Argument Functions
-    /// If the function has zero parameters (e.g. `now()`) it will be passed a
-    /// single element slice which is a a null array to indicate the batch's row
-    /// count (so the function can know the resulting array size).
-    ///


Can we please leave a note about what is required to implement Zero Argument Functions? I think the expectation is that the output is a single ColumnarValue::Scalar, rather than an Array

alamb · 2024-04-23T18:01:59Z

datafusion/functions/src/math/random.rs

-
-        assert_eq!(floats.len(), 1);
-        assert!(0.0 <= floats.value(0) && floats.value(0) < 1.0);
+    fn invoke(&self, _args: &[ColumnarValue]) -> Result<ColumnarValue> {


I think the idea here is that expectation is that rand is invoked once per row rather than once per batch. And the only way it knew how many rows to make is to get a null array in 🤔

For example, when I run datafusion-cli from this PR to call random() the same value is returned for each row:

> create table foo as values (1), (2), (3), (4), (5); 0 row(s) fetched. Elapsed 0.018 seconds. > select column1, random() from foo; +---------+--------------------+ | column1 | random() | +---------+--------------------+ | 1 | 0.9594375709000513 | | 2 | 0.9594375709000513 | | 3 | 0.9594375709000513 | | 4 | 0.9594375709000513 | | 5 | 0.9594375709000513 | +---------+--------------------+ 5 row(s) fetched. Elapsed 0.012 seconds.

But I expect that each row has a different value for random()

However, since none of the tests failed, clearly we have a gap in test coverage 🤔

Nice catch! Let me think about how to design it, I would prefer something like support_random to specialize random() case.

What about adding a invoke_no_args(num_rows: usize) method to the ScalarUDFImpl -- with a default implementation that returns "not implemented" error

That might make it clear what was happening and would provide clear semantics about what to do in this case 🤔

I didn't see the message then, it is also a good idea.

Do we test the uuid function? i think the uuid has the same attribute like random
cc @jayzhan211

We have test like

query II SELECT octet_length(uuid()), length(uuid()) ---- 36 36

We have test like

query II SELECT octet_length(uuid()), length(uuid()) ---- 36 36

I means the uuid function is also invoked once per row rather than once per batch like random() function mentioned by alamb.

I test the function in the spark

spark-sql> desc test; col1 int NULL Time taken: 0.065 seconds, Fetched 1 row(s) spark-sql> select * from test; 1 2 spark-sql> select *,uuid() from test; 1 6b04b66c-2e6c-4925-8b18-a9d51d5ed80a 2 3e0be0c2-9ff2-422a-8f3f-5cdb6551264b

file a issue for this: #10247

Signed-off-by: jayzhan211 <[email protected]>

datafusion/physical-expr/src/scalar_function.rs

jayzhan211 · 2024-04-24T10:49:38Z

How about adding num_rows to invoke function? Let user decide how to deal with args and num_rows? ```rust fn invoke(&self, args: &[ColumnarValue], num_rows: usize) -> Result<ColumnarValue>; ``` It also satisfies the need for both args and num_rows at once.

…

On Wed, Apr 24, 2024, 6:27 PM Kun Liu ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In datafusion/physical-expr/src/scalar_function.rs <#10193 (comment)>: > // evaluate the function match self.fun { - ScalarFunctionDefinition::UDF(ref fun) => fun.invoke(&inputs), + ScalarFunctionDefinition::UDF(ref fun) => { + if fun.support_randomness() { + fun.invoke_no_args(batch.num_rows()) Do we have any method to hide the special behavior? — Reply to this email directly, view it on GitHub <#10193 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZCLR3GAV44GZ2J76JUXXDY66CITAVCNFSM6AAAAABGUPBNIWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMJZGQ4TANZSGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Signed-off-by: jayzhan211 <[email protected]>

alamb

Thanks @jayzhan211 -- I think this now looks good. I retested with datafusion-cli on this PR and they make different values for each row.

DataFusion CLI v37.1.0
> create table t as values (1), (2);
0 row(s) fetched.
Elapsed 0.023 seconds.

> select random(), uuid() from t;
+--------------------+--------------------------------------+
| random()           | uuid()                               |
+--------------------+--------------------------------------+
| 0.7899418736375414 | 0e35ed7f-a059-43c1-9fe2-52afaf38c1d7 |
| 0.3470811911542333 | 42be02cc-cf5a-4736-b768-536cff69eb9c |
+--------------------+--------------------------------------+
2 row(s) fetched.
Elapsed 0.014 seconds.

I am going to add some additional tests to ensure this case is covered

alamb · 2024-04-26T11:19:38Z

Here is a PR with tests that validate we get the right answer for multiple rows: #10247

Thanks again @jayzhan211 for cleaning this up and @liukun4515 for your reviews

jayzhan211 · 2024-04-26T12:38:12Z

Thanks @alamb and @liukun4515

Avoid create null array for empty args

eabfe68

Signed-off-by: jayzhan211 <[email protected]>

github-actions bot added the physical-expr Physical Expressions label Apr 23, 2024

jayzhan211 commented Apr 23, 2024

View reviewed changes

datafusion/physical-expr/src/scalar_function.rs Show resolved Hide resolved

fix test

7b529d9

Signed-off-by: jayzhan211 <[email protected]>

jayzhan211 mentioned this pull request Apr 23, 2024

Move coalesce to datafusion-functions and remove BuiltInScalarFunction #10098

Merged

fix test

03ec8b5

Signed-off-by: jayzhan211 <[email protected]>

github-actions bot added the core Core DataFusion crate label Apr 23, 2024

jayzhan211 marked this pull request as ready for review April 23, 2024 09:53

alamb reviewed Apr 23, 2024

View reviewed changes

jayzhan211 marked this pull request as draft April 23, 2024 12:07

jayzhan211 changed the title ~~Avoid creating null array for empty args~~ ScalarUDF: Remove supports_zero_argument and avoid creating null array for empty args Apr 23, 2024

jayzhan211 commented Apr 23, 2024

View reviewed changes

jayzhan211 added 3 commits April 23, 2024 20:22

return scalar instead of array

36e685e

Signed-off-by: jayzhan211 <[email protected]>

remove supports 0 args in scalarudf

7b04c0b

Signed-off-by: jayzhan211 <[email protected]>

cleanup

7c10382

Signed-off-by: jayzhan211 <[email protected]>

github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Apr 23, 2024

jayzhan211 commented Apr 23, 2024

View reviewed changes

rm test1

864d197

Signed-off-by: jayzhan211 <[email protected]>

github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Apr 23, 2024

jayzhan211 commented Apr 23, 2024

View reviewed changes

jayzhan211 marked this pull request as ready for review April 23, 2024 13:44

alamb reviewed Apr 23, 2024

View reviewed changes

jayzhan211 mentioned this pull request Apr 24, 2024

ScalarUDF: Remove supports_zero_argument and avoid creating null array for empty args #10205

Closed

jayzhan211 marked this pull request as draft April 24, 2024 00:44

invoke no args and support randomness

5b51fb7

Signed-off-by: jayzhan211 <[email protected]>

jayzhan211 marked this pull request as ready for review April 24, 2024 03:05

jayzhan211 requested a review from alamb April 24, 2024 03:05

liukun4515 reviewed Apr 24, 2024

View reviewed changes

datafusion/physical-expr/src/scalar_function.rs Show resolved Hide resolved

This was referenced Apr 25, 2024

DataFusion weekly project plan (Andrew Lamb) - April 22, 2024 #10172

Closed

Simplify no argument handling jayzhan211/arrow-datafusion#2

Closed

jayzhan211 added 2 commits April 26, 2024 08:03

rm randomness

88d2a33

Signed-off-by: jayzhan211 <[email protected]>

add func with no args

7c81776

Signed-off-by: jayzhan211 <[email protected]>

liukun4515 mentioned this pull request Apr 26, 2024

uuid and random need return different value in different row #10247

Closed

jayzhan211 marked this pull request as draft April 26, 2024 09:22

array

bd4c65b

Signed-off-by: jayzhan211 <[email protected]>

jayzhan211 marked this pull request as ready for review April 26, 2024 09:46

alamb approved these changes Apr 26, 2024

View reviewed changes

alamb mentioned this pull request Apr 26, 2024

Add tests that random() and uuid() produce unique values for each row #10248

Merged

jayzhan211 merged commit c9bd291 into apache:main Apr 26, 2024
24 checks passed

alamb added the api change Changes the API exposed to users of the crate label Apr 26, 2024

alamb mentioned this pull request Apr 29, 2024

DataFusion weekly project plan (Andrew Lamb) - April 29, 2024 #10283

Closed

8 tasks

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ScalarUDF: Remove `supports_zero_argument` and avoid creating null array for empty args #10193

ScalarUDF: Remove `supports_zero_argument` and avoid creating null array for empty args #10193

jayzhan211 commented Apr 23, 2024 •

edited

Loading

alamb left a comment

alamb Apr 23, 2024

jayzhan211 Apr 23, 2024

jayzhan211 Apr 23, 2024

alamb Apr 23, 2024

alamb Apr 23, 2024

alamb Apr 23, 2024

jayzhan211 Apr 23, 2024

jayzhan211 Apr 23, 2024

jayzhan211 Apr 23, 2024

alamb Apr 23, 2024

alamb Apr 23, 2024

alamb Apr 23, 2024

alamb Apr 23, 2024

jayzhan211 Apr 23, 2024

alamb Apr 24, 2024

jayzhan211 Apr 24, 2024 •

edited

Loading

liukun4515 Apr 24, 2024

jayzhan211 Apr 24, 2024

liukun4515 Apr 26, 2024

liukun4515 Apr 26, 2024

jayzhan211 commented Apr 24, 2024 via email •

edited

Loading

alamb left a comment

alamb commented Apr 26, 2024

jayzhan211 commented Apr 26, 2024

	return exec_err!("Expect random function to take no param");
	return exec_err!("Expect {} function to take no parameters", self.name());

	return internal_err!("Invalid argument type");
	return exec_err!("Expect function to take no parameters");

ScalarUDF: Remove supports_zero_argument and avoid creating null array for empty args #10193

ScalarUDF: Remove supports_zero_argument and avoid creating null array for empty args #10193

Conversation

jayzhan211 commented Apr 23, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 commented Apr 24, 2024 via email • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb commented Apr 26, 2024

jayzhan211 commented Apr 26, 2024

ScalarUDF: Remove `supports_zero_argument` and avoid creating null array for empty args #10193

ScalarUDF: Remove `supports_zero_argument` and avoid creating null array for empty args #10193

jayzhan211 commented Apr 23, 2024 •

edited

Loading

jayzhan211 Apr 24, 2024 •

edited

Loading

jayzhan211 commented Apr 24, 2024 via email •

edited

Loading