Validate ScalarUDF output rows and fix nulls for `array_has` and `get_field` for `Map` #10148

duongcongtoai · 2024-04-20T08:12:37Z

Which issue does this PR close?

Closes #5735.

Add a small constraint for each UDF to have the same input and output size, with "arrow_typeof" as an exception

2 implementation failed the constraint (also fixed inside this PR)

array_has
get_field

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

…ow-count

jayzhan211 · 2024-04-20T08:26:48Z

datafusion/physical-expr/src/scalar_function.rs

-            ScalarFunctionDefinition::UDF(ref fun) => fun.invoke(&inputs),
+            ScalarFunctionDefinition::UDF(ref fun) => {
+                let output = fun.invoke(&inputs)?;
+                let output_count = match &output {


datafusion/datafusion/functions/src/utils.rs

Line 108 in 19356b2

let result = (inner)(&args);

I think we should check it inside the function wrapper

then anyone who implement ScalarUDFImpl trait and not using this util can miss this validation right?

for example this function:

datafusion/datafusion-examples/examples/advanced_udf.rs

Line 96 in b2cbc78

fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {

then anyone who implement ScalarUDFImpl trait and not using this util can miss this validation right?

True, but in another way, all the function is forced to check the length, even if it does not expect to.

But, I didn't think of any example that the rows length may change, probably only aggregate func like arrayagg maybe change the rows length 🤔

i think the initial purpose of this validation is to announce users who define their own udf to follow this constraint (maybe they provide a wrong implementation, and our code just panic with ambiguous error, for example the error in the reported issue #5635

after adding this validation, i found one udf violating this rule which is arrow_typeof.
Can we consider this function a udaf, i need your opinion here @alamb

ColumnarValue::Scalar is used for single row

hmm, maybe this is the point i got confused, because the way i look at how this ColumnarValue work, this is not correct. e.g in projection stream above. How i understand is: ColumnarValue::Scalar looks like a single value represent a single shared value of n rows

Skip the validation for arrow_typeof is fine to me, but skip it for ScalarValue in general is not. I think most of the function expects scalar in scalar out, so we still need to check if the result is scalar (single row).

will be done some where else, depends on the plan

The assumption here is possible to change in the future, if it changes, the code breaks

thank you for the review :D understood

Btw I also tested with Spark and Duckdb for their version of "typeof"
Duckdb

toai@ToaiPC:~/proj/rust/arrow-datafusion/datafusion/core/tests/data$ duckdb v0.10.2 1601d94f94 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. D select typeof(ints) from parquet_map.parquet ; ┌──────────────────────┐ │ typeof(ints) │ │ varchar │ ├──────────────────────┤ │ MAP(VARCHAR, BIGINT) │ │ MAP(VARCHAR, BIGINT) │ │ MAP(VARCHAR, BIGINT) │ │ MAP(VARCHAR, BIGINT) │ │ MAP(VARCHAR, BIGINT) │ │ MAP(VARCHAR, BIGINT) │

and Spark

val data = Seq( ("Alice", Map("age" -> "25", "email" -> "[email protected]")), ("Bob", Map("age" -> "30", "email" -> "[email protected]")), ("Carol", Map("age" -> "35", "email" -> "[email protected]")) ) val df = data.toDF("name", "attributes") val types = df.select($"name", typeof($"name").as("name_type"), $"attributes", typeof($"attributes").as("attributes_type")) types.show(false) +-----+---------+---------------------------------------+------------------+ |name |name_type|attributes |attributes_type | +-----+---------+---------------------------------------+------------------+ |Alice|string |{age -> 25, email -> [email protected]}|map<string,string>| |Bob |string |{age -> 30, email -> [email protected]} |map<string,string>| |Carol|string |{age -> 35, email -> [email protected]}|map<string,string>| +-----+---------+---------------------------------------+------------------+

Look like they also respect number of input row

oh actually current arrow_typeof also respect this behavior (added extra test case in arrow_typeof.slt)

query T select arrow_typeof(col) from (select 1 as col union select 2 as col) as table_a; ---- Int64 Int64

=> ArrowTypeOfFunc always return a Scalar

Ok(ColumnarValue::Scalar(ScalarValue::from(format!( "{input_data_type}" ))))

but the rendered record batch shows n rows anyway, does this mean "wrong implementation but with correct result"?

duongcongtoai · 2024-04-21T07:59:32Z

External error: query failed: DataFusion error: Internal error: UDF returned a different number of rows than expected. Expected: 7, Got: 6.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
[SQL] select array_has(column1, make_array(5, 6)),
       array_has(column1, make_array(7, NULL)),
       array_has(column2, 5.5),
       array_has(column3, 'o')
from arrays;
at test_files/array.slt:5161

The omitted row is where column1 = null

duongcongtoai · 2024-04-21T15:59:00Z

datafusion/sqllogictest/test_files/array.slt

@@ -5169,8 +5169,9 @@ false false false true
 true false true false
 true false false true
 false true false false
-false false false false
-false false false false


The test result does not look correct, because it ignore some null rows in between

…ow-count

jayzhan211 · 2024-04-26T00:58:59Z

datafusion/physical-expr/src/scalar_function.rs

+            ScalarFunctionDefinition::UDF(ref fun) => {
+                let output = fun.invoke(&inputs)?;
+                // Only arrow_typeof can bypass this rule
+                if fun.name() != "arrow_typeof" {


I think we should introduce validate_number_of_rows to ScalarUDFImpl with the default value is true.

i added to the trait and the wrapper

jayzhan211 · 2024-04-26T00:59:53Z

datafusion/functions-array/src/array_has.rs

+            }
+            // respect null input
+            (_, _) => {
+                boolean_builder.append_null();


jayzhan211 · 2024-04-26T01:19:42Z

datafusion/sqllogictest/test_files/map.slt

@@ -44,6 +44,7 @@ DELETE 24
 query T
 SELECT strings['not_found'] FROM data LIMIT 1;
 ----
+NULL


I'm not familiar with Map. Why should we return null here?
Without the change in Map, what is the error like?

It will throws the invalidation error i added in this PR. I think the correct behavior is to return null for every input rows not having the associated key. I took a look at duckdb and spark and they also have this behavior

import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().appName("Spark SQL Map Example").getOrCreate() import spark.implicits._ val data = Seq( ("Alice", Map("age" -> "25", "email" -> "[email protected]")), ("Bob", Map("age" -> "30", "email" -> "[email protected]")), ("Carol", Map("age" -> "35", "email" -> "[email protected]")) ) val df = data.toDF("name", "attributes") val result = df.select($"name", $"attributes.email".as("email"),$"attributes.notfound".as("should_be_null")) // Show the DataFrame result.show(false) +-----+-----------------+--------------+ |name |email |should_be_null| +-----+-----------------+--------------+ |Alice|[email protected]|NULL | |Bob |[email protected] |NULL | |Carol|[email protected]|NULL | +-----+-----------------+--------------+

And also, a similar implementation in Datafusion is array_element also return null if the index goes out of range

Here is how it works on main:

> select strings['not_found'] from 'datafusion/core/tests/data/parquet_map.parquet'; 0 row(s) fetched. Elapsed 0.006 seconds.

Here is how it works on this PR (aka has a single row for each input row)

DataFusion CLI v37.1.0 > select strings['not_found'] from '../datafusion/core/tests/data/parquet_map.parquet'; +----------------------------------------------------------------------+ | ../datafusion/core/tests/data/parquet_map.parquet.strings[not_found] | +----------------------------------------------------------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | . | | . | | . | +----------------------------------------------------------------------+ 209 row(s) fetched. (First 40 displayed. Use --maxrows to adjust) Elapsed 0.033 seconds.

duongcongtoai · 2024-04-26T07:14:11Z

datafusion/functions/src/core/getfield.rs

-                    let map_array = as_map_array(array.as_ref())?;
-                    let key_scalar = Scalar::new(StringArray::from(vec![k.clone()]));
-                    let keys = arrow::compute::kernels::cmp::eq(&key_scalar, map_array.keys())?;
-                    let entries = arrow::compute::filter(map_array.entries(), &keys)?;


using filter will reduce the number of input rows to the number of rows that have keys matching the input key. But we want to respect the number of input rows, and give null for any rows not having the matching key

I don't understand this

If the input is like this (two rows, each three elements)

{ a: 1, b: 2, c: 100} { a: 3, b: 4, c: 200}

An expression like col['c'] will still return 2 rows (but each row will have only a single element)

{ c: 100 } { c: 200 }

Previous implememtation

map_array.entries() has type of

pub struct StructArray { len: usize, data_type: DataType, nulls: Option<NullBuffer>, fields: Vec<ArrayRef>, }

With the example above, the layout of field "fields" will be a vector of 2 array, where first array is a list of key, and second array is a list of value

[0]: ["a","b","c","a","b",c"] [1]: [1,2,100,3,4,200]

let keys = arrow::compute::kernels::cmp::eq(&key_scalar, map_array.keys())?;

with this computation, the result is a boolean aray where "key" = "c"

[false,false,true,false,false,true]

and thus this operation will reduce the number of rows into

let entries = arrow::compute::filter(map_array.entries(), &keys)?;

[0]: ["c,"c"] [1]: [100,200]

Problem

However, let's add a row where the map does not have key "c" in between

{ a: 1, b: 2, c: 100} { a: 1, b: 2} { a: 3, b: 4, c: 200}

map_array.entries() underneath is represented as

[0]: ["a,"b","c","a","b","a","b","c"] [1]: [1,2,100,1,2,3,4,200] let entries = arrow::compute::filter(map_array.entries(), &keys)?; Now rows after filtered will be [0]: ["c","c"] [1]: [100,200]

and the return result will be

{ c: 100 } { c: 200 }

instead of

{ c: 100 } null { c: 200 }

I would expect the result of evaluating col[b] on

{ a: 1, b: 2, c: 100} { a: 1, b: 2} { a: 3, b: 4, c: 200}

to be:

{ c: 100 } null { c: 200 }

For example, in duckdb:

D create table foo as values (MAP {'a':1, 'b':2, 'c':100}), (MAP{ 'a':1, 'b':2}), (MAP {'a':1, 'b':2, 'c':200}); D select * from foo; ┌───────────────────────┐ │ col0 │ │ map(varchar, integer) │ ├───────────────────────┤ │ {a=1, b=2, c=100} │ │ {a=1, b=2} │ │ {a=1, b=2, c=200} │ └───────────────────────┘ D select col0['c'] from foo; ┌───────────┐ │ col0['c'] │ │ int32[] │ ├───────────┤ │ [100] │ │ [] │ │ [200] │ └───────────┘

Basically a scalar function has the invarant that each input row produces exactly 1 output row

yes, the previous implementation will not return null (according to slt https://github.com/apache/datafusion/pull/10148/files#diff-4db02ec06ada20062cbb518eeb713c2643f5a51e89774bdadd9966410baa7d1dR47)

i also explained in this discussion: #10148 (comment)

…ow-count

jayzhan211

👍

alamb

Thank you for the work here @duongcongtoai and @jayzhan211 . I don't think this PR handles scalar values correctly -- I left some comments explaining why. Let me know what you think

alamb · 2024-04-27T11:15:51Z

datafusion/physical-expr/src/scalar_function.rs

+                if fun.validate_number_of_rows() {
+                    let output_size = match &output {
+                        ColumnarValue::Array(array) => array.len(),
+                        ColumnarValue::Scalar(_) => 1,


I don't think this logic is correct -- specifically ColumnarValue::Scalar represents (logically) a single value for all the rows.

Thus I think if the function returns ColumnarValue::Scalar that implicitly can be any number of rows.

Therefore I think the check could be something like

if let ColumnarValue::Array(array) = &output { if output_size != array.len() { return internal_err... } }

I'll try and make a PR to improve the documentation on ColumnarValue to make this clearer: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.ColumnarValue.html

ah yes, so we don't need to check the length if the output is a scalar value, we actually had a discussion here: #10148 (comment)

Here is a PR to improve the docs #10265

oh then perhaps the current implementation of arrow_typeof can be wrong, if we have a map like this:

{id:1,name:"bob"} {id:"string_id",name:"alice"}

then arrow_typeof(map.id) will returns

int64 int64

Is it possible to have this kind of input, because a map's schema can be dynamic per row

I see. I agree that we can check with Array only.

why is map key allowed to have different types, which db has the similar model?

Aha i 'v read duckdb's map definition again

A dictionary of multiple named values, each key having the same type and each value having the same type. Keys and values can be any type and can be different types from one another

I am new to this kind of project, but i think we don't have and should not have that use case 🤔. Else it will affect alot of execution logic

In Duckdb, map keys can be different value for each row, but the types should not be different.
In OLAP which has the columnar format (we use arrow), it does not make sense to have different types on different rows.

MAPs must have a single type for all keys, and a single type for all values

alamb · 2024-04-27T11:17:06Z

datafusion/functions/src/core/arrowtypeof.rs

@@ -69,4 +69,8 @@ impl ScalarUDFImpl for ArrowTypeOfFunc {
            "{input_data_type}"
        ))))
    }
+
+    fn validate_number_of_rows(&self) -> bool {


I don't think this is correct -- arrow_typeof always returns the same number of rows as its input...

then i think this new method is no longer needed

alamb · 2024-04-27T11:19:11Z

datafusion/functions/src/core/getfield.rs

-                    let map_array = as_map_array(array.as_ref())?;
-                    let key_scalar = Scalar::new(StringArray::from(vec![k.clone()]));
-                    let keys = arrow::compute::kernels::cmp::eq(&key_scalar, map_array.keys())?;
-                    let entries = arrow::compute::filter(map_array.entries(), &keys)?;


I don't understand this

If the input is like this (two rows, each three elements)

{ a: 1, b: 2, c: 100} { a: 3, b: 4, c: 200}

An expression like col['c'] will still return 2 rows (but each row will have only a single element)

{ c: 100 } { c: 200 }

duongcongtoai · 2024-04-28T05:50:35Z

Requested change made, please help me check again @alamb

alamb

Thank you @duongcongtoai -- this looks good to me. Thank you @jayzhan211 for the review

All in all this is a really nice change 🏆 : thank you for fixing the underlying problem that the supposedly easy-to-add check exposed

alamb · 2024-04-29T19:15:37Z

datafusion/core/tests/user_defined/user_defined_scalar_functions.rs

+            .err()
+            .unwrap()
+            .to_string(),
+        "UDF returned a different number of rows than expected"


👌 -- very nice

alamb · 2024-04-29T19:18:09Z

datafusion/sqllogictest/test_files/map.slt

@@ -44,6 +44,7 @@ DELETE 24
 query T


I had to remind myself what this data looked like. Here it is for anyone else who may be interested

DataFusion CLI v37.1.0 > select * from 'datafusion/core/tests/data/parquet_map.parquet'; +----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ | ints | strings | timestamp | +----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ | {bytes: 38906} | {host: 198.194.132.41, method: GET, protocol: HTTP/1.0, referer: https://some.com/this/endpoint/prints/money, request: /observability/metrics/production, status: 400, user-identifier: shaneIxD} | 06/Oct/2023:17:53:45 | | {bytes: 44606} | {host: 140.115.224.194, method: PATCH, protocol: HTTP/1.0, referer: https://we.org/user/booperbot124, request: /booper/bopper/mooper/mopper, status: 304, user-identifier: jesseddy} | 06/Oct/2023:17:53:45 | | {bytes: 23517} | {host: 63.69.43.67, method: GET, protocol: HTTP/2.0, referer: https://random.net/booper/bopper/mooper/mopper, request: /booper/bopper/mooper/mopper, status: 550, user-identifier: jesseddy} | 06/Oct/2023:17:53:45 | | {bytes: 44876} | {host: 69.4.253.156, method: PATCH, protocol: HTTP/1.1, referer: https://some.net/booper/bopper/mooper/mopper, request: /user/booperbot124, status: 403, user-identifier: Karimmove} | 06/Oct/2023:17:53:45 | | {bytes: 34122} | {host: 239.152.196.123, method: DELETE, protocol: HTTP/2.0, referer: https://for.com/observability/metrics/production, request: /apps/deploy, status: 403, user-identifier: meln1ks} | 06/Oct/2023:17:53:45 | | {bytes: 37438} | {host: 95.243.186.123, method: DELETE, protocol: HTTP/1.1, referer: https://make.de/wp-admin, request: /wp-admin, status: 550, user-identifier: Karimmove} | 06/Oct/2023:17:53:45 | | {bytes: 45784} | {host: 66.142.251.66, method: PUT, protocol: HTTP/2.0, referer: https://some.org/apps/deploy, request: /secret-info/open-sesame, status: 403, user-identifier: benefritz} | 06/Oct/2023:17:53:45 | | {bytes: 27788} | {host: 157.85.140.215, method: GET, protocol: HTTP/1.1, referer: https://random.de/booper/bopper/mooper/mopper, request: /booper/bopper/mooper/mopper, status: 401, user-identifier: devankoshal} | 06/Oct/2023:17:53:45 | | {bytes: 5344} | {host: 62.191.179.3, method: POST, protocol: HTTP/1.0, referer: https://random.org/booper/bopper/mooper/mopper, request: /observability/metrics/production, status: 400, user-identifier: jesseddy} | 06/Oct/2023:17:53:45 | | {bytes: 9136} | {host: 237.213.221.20, method: PUT, protocol: HTTP/2.0, referer: https://some.us/this/endpoint/prints/money, request: /observability/metrics/production, status: 304, user-identifier: ahmadajmi} | 06/Oct/2023:17:53:46 | | {bytes: 5640} | {host: 38.148.115.2, method: GET, protocol: HTTP/1.0, referer: https://for.net/apps/deploy, request: /do-not-access/needs-work, status: 301, user-identifier: benefritz} | 06/Oct/2023:17:53:46 | ...

alamb · 2024-04-29T19:22:31Z

datafusion/sqllogictest/test_files/map.slt

@@ -44,6 +44,7 @@ DELETE 24
 query T
 SELECT strings['not_found'] FROM data LIMIT 1;
 ----
+NULL


Here is how it works on main:

> select strings['not_found'] from 'datafusion/core/tests/data/parquet_map.parquet'; 0 row(s) fetched. Elapsed 0.006 seconds.

Here is how it works on this PR (aka has a single row for each input row)

DataFusion CLI v37.1.0 > select strings['not_found'] from '../datafusion/core/tests/data/parquet_map.parquet'; +----------------------------------------------------------------------+ | ../datafusion/core/tests/data/parquet_map.parquet.strings[not_found] | +----------------------------------------------------------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | . | | . | | . | +----------------------------------------------------------------------+ 209 row(s) fetched. (First 40 displayed. Use --maxrows to adjust) Elapsed 0.033 seconds.

alamb · 2024-04-29T19:25:37Z

datafusion/sqllogictest/test_files/array.slt

@@ -5197,8 +5199,9 @@ false false false true
 true false true false
 true false false true
 false true false false
-false false false false
-false false false false
+NULL NULL false false


Iikewise I agree this should have 7 output rows

datafusion/datafusion/sqllogictest/test_files/array.slt

Lines 80 to 90 in cc53bd3

statement ok

CREATE TABLE fixed_size_arrays

AS VALUES

(arrow_cast(make_array(make_array(NULL, 2),make_array(3, NULL)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(1.1, 2.2, 3.3), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('L', 'o', 'r', 'e', 'm'), 'FixedSizeList(5, Utf8)')),

(arrow_cast(make_array(make_array(3, 4),make_array(5, 6)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(NULL, 5.5, 6.6), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('i', 'p', NULL, 'u', 'm'), 'FixedSizeList(5, Utf8)')),

(arrow_cast(make_array(make_array(5, 6),make_array(7, 8)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(7.7, 8.8, 9.9), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('d', NULL, 'l', 'o', 'r'), 'FixedSizeList(5, Utf8)')),

(arrow_cast(make_array(make_array(7, NULL),make_array(9, 10)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(10.1, NULL, 12.2), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('s', 'i', 't', 'a', 'b'), 'FixedSizeList(5, Utf8)')),

(NULL, arrow_cast(make_array(13.3, 14.4, 15.5), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('a', 'm', 'e', 't', 'x'), 'FixedSizeList(5, Utf8)')),

(arrow_cast(make_array(make_array(11, 12),make_array(13, 14)), 'FixedSizeList(2, List(Int64))'), NULL, arrow_cast(make_array(',','a','b','c','d'), 'FixedSizeList(5, Utf8)')),

(arrow_cast(make_array(make_array(15, 16),make_array(NULL, 18)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(16.6, 17.7, 18.8), 'FixedSizeList(3, Float64)'), NULL)

;

alamb · 2024-04-29T19:25:38Z

datafusion/sqllogictest/test_files/array.slt

@@ -5183,8 +5184,9 @@ false false false true
 true false true false
 true false false true
 false true false false
-false false false false
-false false false false
+NULL NULL false false


I double checked and the arrays table has 7 rows, so I agree the correct answer has 7 output rows as well

datafusion/datafusion/sqllogictest/test_files/array.slt

Lines 58 to 68 in cc53bd3

statement ok

CREATE TABLE arrays

AS VALUES

(make_array(make_array(NULL, 2),make_array(3, NULL)), make_array(1.1, 2.2, 3.3), make_array('L', 'o', 'r', 'e', 'm')),

(make_array(make_array(3, 4),make_array(5, 6)), make_array(NULL, 5.5, 6.6), make_array('i', 'p', NULL, 'u', 'm')),

(make_array(make_array(5, 6),make_array(7, 8)), make_array(7.7, 8.8, 9.9), make_array('d', NULL, 'l', 'o', 'r')),

(make_array(make_array(7, NULL),make_array(9, 10)), make_array(10.1, NULL, 12.2), make_array('s', 'i', 't')),

(NULL, make_array(13.3, 14.4, 15.5), make_array('a', 'm', 'e', 't')),

(make_array(make_array(11, 12),make_array(13, 14)), NULL, make_array(',')),

(make_array(make_array(15, 16),make_array(NULL, 18)), make_array(16.6, 17.7, 18.8), NULL)

;

duongcongtoai added 3 commits April 20, 2024 10:03

validate input/output of udf

c94cac8

clip

5d67c73

Merge remote-tracking branch 'upstream/main' into 5735-validate-udf-r…

96f4fe2

…ow-count

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Apr 20, 2024

duongcongtoai added 2 commits April 20, 2024 10:14

fmt

5587b03

clean garbage

bba9bfe

duongcongtoai changed the title ~~Validate UDF input and ouput size~~ Minor: Validate UDF input and ouput size Apr 20, 2024

jayzhan211 reviewed Apr 20, 2024

View reviewed changes

duongcongtoai added 2 commits April 21, 2024 08:23

don't check if output is scalar

9be1a5a

lint

2a93b8c

fix array_has

9743382

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 21, 2024

rm debug

7eebcf2

duongcongtoai commented Apr 21, 2024

View reviewed changes

chore: temp code for demonstration

c972d7d

duongcongtoai marked this pull request as draft April 23, 2024 21:07

duongcongtoai added 5 commits April 25, 2024 21:28

getfield retains number of rows

e5bbfaf

rust fmt

ed41d3a

minor comments

6603135

Merge remote-tracking branch 'upstream/main' into 5735-validate-udf-r…

cc53bd3

…ow-count

fmt

cf7fac3

duongcongtoai marked this pull request as ready for review April 25, 2024 19:43

duongcongtoai marked this pull request as draft April 25, 2024 19:44

refactor

fc304ae

duongcongtoai marked this pull request as ready for review April 25, 2024 19:54

duongcongtoai marked this pull request as draft April 25, 2024 19:55

compile err

e70245e

duongcongtoai marked this pull request as ready for review April 25, 2024 20:00

fmt again

cda3e3b

jayzhan211 reviewed Apr 26, 2024

View reviewed changes

datafusion/functions-array/src/array_has.rs

}

// respect null input

(_, _) => {

boolean_builder.append_null();

Copy link

Contributor

jayzhan211 Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

jayzhan211 reviewed Apr 26, 2024

View reviewed changes

duongcongtoai commented Apr 26, 2024

View reviewed changes

jayzhan211 changed the title ~~Minor: Validate UDF input and ouput size~~ Validate ScalarUDF output rows and support nulls for array_has and get_field for Map Apr 26, 2024

duongcongtoai added 3 commits April 26, 2024 21:47

Merge remote-tracking branch 'upstream/main' into 5735-validate-udf-r…

da77fb2

…ow-count

fmt

efb1c5f

add validate_number_of_rows for UDF

6c397e8

github-actions bot added the logical-expr Logical plan and expressions label Apr 26, 2024

jayzhan211 approved these changes Apr 27, 2024

View reviewed changes

duongcongtoai mentioned this pull request Apr 27, 2024

Check the udf output size which should be equal to the input size #5735

Closed

alamb mentioned this pull request Apr 27, 2024

DataFusion weekly project plan (Andrew Lamb) - April 22, 2024 #10172

Closed

7 tasks

alamb requested changes Apr 27, 2024

View reviewed changes

alamb mentioned this pull request Apr 27, 2024

Improve documentation on ColumnarValue #10265

Merged

only check for columnarvalue::array

c1458c2

github-actions bot removed the logical-expr Logical plan and expressions label Apr 27, 2024

alamb approved these changes Apr 29, 2024

View reviewed changes

alamb changed the title ~~Validate ScalarUDF output rows and support nulls for array_has and get_field for Map~~ Validate ScalarUDF output rows and fix nulls for array_has and get_field for Map Apr 29, 2024

alamb merged commit 0f2a68e into apache:main Apr 29, 2024
24 checks passed

alamb mentioned this pull request Apr 29, 2024

Minor: Add some more tests to map.slt #10301

Merged

This pull request was closed.

	statement ok
	CREATE TABLE fixed_size_arrays
	AS VALUES
	(arrow_cast(make_array(make_array(NULL, 2),make_array(3, NULL)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(1.1, 2.2, 3.3), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('L', 'o', 'r', 'e', 'm'), 'FixedSizeList(5, Utf8)')),
	(arrow_cast(make_array(make_array(3, 4),make_array(5, 6)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(NULL, 5.5, 6.6), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('i', 'p', NULL, 'u', 'm'), 'FixedSizeList(5, Utf8)')),
	(arrow_cast(make_array(make_array(5, 6),make_array(7, 8)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(7.7, 8.8, 9.9), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('d', NULL, 'l', 'o', 'r'), 'FixedSizeList(5, Utf8)')),
	(arrow_cast(make_array(make_array(7, NULL),make_array(9, 10)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(10.1, NULL, 12.2), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('s', 'i', 't', 'a', 'b'), 'FixedSizeList(5, Utf8)')),
	(NULL, arrow_cast(make_array(13.3, 14.4, 15.5), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('a', 'm', 'e', 't', 'x'), 'FixedSizeList(5, Utf8)')),
	(arrow_cast(make_array(make_array(11, 12),make_array(13, 14)), 'FixedSizeList(2, List(Int64))'), NULL, arrow_cast(make_array(',','a','b','c','d'), 'FixedSizeList(5, Utf8)')),
	(arrow_cast(make_array(make_array(15, 16),make_array(NULL, 18)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(16.6, 17.7, 18.8), 'FixedSizeList(3, Float64)'), NULL)
	;

	statement ok
	CREATE TABLE arrays
	AS VALUES
	(make_array(make_array(NULL, 2),make_array(3, NULL)), make_array(1.1, 2.2, 3.3), make_array('L', 'o', 'r', 'e', 'm')),
	(make_array(make_array(3, 4),make_array(5, 6)), make_array(NULL, 5.5, 6.6), make_array('i', 'p', NULL, 'u', 'm')),
	(make_array(make_array(5, 6),make_array(7, 8)), make_array(7.7, 8.8, 9.9), make_array('d', NULL, 'l', 'o', 'r')),
	(make_array(make_array(7, NULL),make_array(9, 10)), make_array(10.1, NULL, 12.2), make_array('s', 'i', 't')),
	(NULL, make_array(13.3, 14.4, 15.5), make_array('a', 'm', 'e', 't')),
	(make_array(make_array(11, 12),make_array(13, 14)), NULL, make_array(',')),
	(make_array(make_array(15, 16),make_array(NULL, 18)), make_array(16.6, 17.7, 18.8), NULL)
	;

Validate ScalarUDF output rows and fix nulls for array_has and get_field for Map #10148

Validate ScalarUDF output rows and fix nulls for array_has and get_field for Map #10148

Conversation

duongcongtoai commented Apr 20, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Apr 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

duongcongtoai Apr 21, 2024 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Apr 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

duongcongtoai Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

duongcongtoai commented Apr 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

duongcongtoai Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

duongcongtoai Apr 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

duongcongtoai Apr 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Apr 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

duongcongtoai commented Apr 28, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Validate ScalarUDF output rows and fix nulls for `array_has` and `get_field` for `Map` #10148

Validate ScalarUDF output rows and fix nulls for `array_has` and `get_field` for `Map` #10148

duongcongtoai commented Apr 20, 2024 •

edited

Loading

jayzhan211 Apr 20, 2024 •

edited

Loading

duongcongtoai Apr 21, 2024 •

edited

Loading

jayzhan211 Apr 21, 2024 •

edited

Loading

duongcongtoai Apr 26, 2024 •

edited

Loading

duongcongtoai commented Apr 21, 2024 •

edited

Loading

jayzhan211 Apr 26, 2024 •

edited

Loading

duongcongtoai Apr 26, 2024 •

edited

Loading

duongcongtoai Apr 27, 2024 •

edited

Loading

duongcongtoai Apr 27, 2024 •

edited

Loading

jayzhan211 Apr 27, 2024 •

edited

Loading