Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate ScalarUDF output rows and fix nulls for array_has and get_field for Map #10148

Merged
merged 22 commits into from
Apr 29, 2024

Conversation

duongcongtoai
Copy link
Contributor

@duongcongtoai duongcongtoai commented Apr 20, 2024

Which issue does this PR close?

Closes #5735.

Add a small constraint for each UDF to have the same input and output size, with "arrow_typeof" as an exception

2 implementation failed the constraint (also fixed inside this PR)

  • array_has
  • get_field

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Apr 20, 2024
@duongcongtoai duongcongtoai changed the title Validate UDF input and ouput size Minor: Validate UDF input and ouput size Apr 20, 2024
ScalarFunctionDefinition::UDF(ref fun) => fun.invoke(&inputs),
ScalarFunctionDefinition::UDF(ref fun) => {
let output = fun.invoke(&inputs)?;
let output_count = match &output {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let result = (inner)(&args);

I think we should check it inside the function wrapper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then anyone who implement ScalarUDFImpl trait and not using this util can miss this validation right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example this function:

fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {

Copy link
Contributor

@jayzhan211 jayzhan211 Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then anyone who implement ScalarUDFImpl trait and not using this util can miss this validation right?

True, but in another way, all the function is forced to check the length, even if it does not expect to.

But, I didn't think of any example that the rows length may change, probably only aggregate func like arrayagg maybe change the rows length 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the initial purpose of this validation is to announce users who define their own udf to follow this constraint (maybe they provide a wrong implementation, and our code just panic with ambiguous error, for example the error in the reported issue #5635

after adding this validation, i found one udf violating this rule which is arrow_typeof.
Can we consider this function a udaf, i need your opinion here @alamb

Copy link
Contributor Author

@duongcongtoai duongcongtoai Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnarValue::Scalar is used for single row

hmm, maybe this is the point i got confused, because the way i look at how this ColumnarValue work, this is not correct. e.g in projection stream above. How i understand is: ColumnarValue::Scalar looks like a single value represent a single shared value of n rows

Copy link
Contributor

@jayzhan211 jayzhan211 Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip the validation for arrow_typeof is fine to me, but skip it for ScalarValue in general is not. I think most of the function expects scalar in scalar out, so we still need to check if the result is scalar (single row).

will be done some where else, depends on the plan

The assumption here is possible to change in the future, if it changes, the code breaks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the review :D understood

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I also tested with Spark and Duckdb for their version of "typeof"
Duckdb

toai@ToaiPC:~/proj/rust/arrow-datafusion/datafusion/core/tests/data$ duckdb 
v0.10.2 1601d94f94
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select typeof(ints) from parquet_map.parquet
  ;
┌──────────────────────┐
│     typeof(ints)     │
│       varchar        │
├──────────────────────┤
│ MAP(VARCHAR, BIGINT) │
│ MAP(VARCHAR, BIGINT) │
│ MAP(VARCHAR, BIGINT) │
│ MAP(VARCHAR, BIGINT) │
│ MAP(VARCHAR, BIGINT) │
│ MAP(VARCHAR, BIGINT) │

and Spark

val data = Seq(
  ("Alice", Map("age" -> "25", "email" -> "[email protected]")),
  ("Bob", Map("age" -> "30", "email" -> "[email protected]")),
  ("Carol", Map("age" -> "35", "email" -> "[email protected]"))
)
val df = data.toDF("name", "attributes")
val types = df.select($"name", typeof($"name").as("name_type"), $"attributes", typeof($"attributes").as("attributes_type"))
types.show(false)
+-----+---------+---------------------------------------+------------------+
|name |name_type|attributes                             |attributes_type   |
+-----+---------+---------------------------------------+------------------+
|Alice|string   |{age -> 25, email -> [email protected]}|map<string,string>|
|Bob  |string   |{age -> 30, email -> [email protected]}  |map<string,string>|
|Carol|string   |{age -> 35, email -> [email protected]}|map<string,string>|
+-----+---------+---------------------------------------+------------------+

Look like they also respect number of input row

Copy link
Contributor Author

@duongcongtoai duongcongtoai Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh actually current arrow_typeof also respect this behavior (added extra test case in arrow_typeof.slt)

query T
select arrow_typeof(col) from (select 1 as col union select 2 as col) as table_a;
----
Int64
Int64

=> ArrowTypeOfFunc always return a Scalar

        Ok(ColumnarValue::Scalar(ScalarValue::from(format!(
            "{input_data_type}"
        ))))

but the rendered record batch shows n rows anyway, does this mean "wrong implementation but with correct result"?

@duongcongtoai
Copy link
Contributor Author

duongcongtoai commented Apr 21, 2024

External error: query failed: DataFusion error: Internal error: UDF returned a different number of rows than expected. Expected: 7, Got: 6.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
[SQL] select array_has(column1, make_array(5, 6)),
       array_has(column1, make_array(7, NULL)),
       array_has(column2, 5.5),
       array_has(column3, 'o')
from arrays;
at test_files/array.slt:5161

The omitted row is where column1 = null

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 21, 2024
@@ -5169,8 +5169,9 @@ false false false true
true false true false
true false false true
false true false false
false false false false
false false false false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test result does not look correct, because it ignore some null rows in between

@duongcongtoai duongcongtoai marked this pull request as draft April 23, 2024 21:07
@duongcongtoai duongcongtoai marked this pull request as ready for review April 25, 2024 19:43
@duongcongtoai duongcongtoai marked this pull request as draft April 25, 2024 19:44
@duongcongtoai duongcongtoai marked this pull request as ready for review April 25, 2024 19:54
@duongcongtoai duongcongtoai marked this pull request as draft April 25, 2024 19:55
@duongcongtoai duongcongtoai marked this pull request as ready for review April 25, 2024 20:00
ScalarFunctionDefinition::UDF(ref fun) => {
let output = fun.invoke(&inputs)?;
// Only arrow_typeof can bypass this rule
if fun.name() != "arrow_typeof" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should introduce validate_number_of_rows to ScalarUDFImpl with the default value is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added to the trait and the wrapper

}
// respect null input
(_, _) => {
boolean_builder.append_null();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -44,6 +44,7 @@ DELETE 24
query T
SELECT strings['not_found'] FROM data LIMIT 1;
----
NULL
Copy link
Contributor

@jayzhan211 jayzhan211 Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with Map. Why should we return null here?
Without the change in Map, what is the error like?

Copy link
Contributor Author

@duongcongtoai duongcongtoai Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will throws the invalidation error i added in this PR. I think the correct behavior is to return null for every input rows not having the associated key. I took a look at duckdb and spark and they also have this behavior

import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().appName("Spark SQL Map Example").getOrCreate()
import spark.implicits._

val data = Seq(
  ("Alice", Map("age" -> "25", "email" -> "[email protected]")),
  ("Bob", Map("age" -> "30", "email" -> "[email protected]")),
  ("Carol", Map("age" -> "35", "email" -> "[email protected]"))
)
val df = data.toDF("name", "attributes")
val result = df.select($"name", $"attributes.email".as("email"),$"attributes.notfound".as("should_be_null"))
// Show the DataFrame
result.show(false)

+-----+-----------------+--------------+
|name |email            |should_be_null|
+-----+-----------------+--------------+
|Alice|[email protected]|NULL          |
|Bob  |[email protected]  |NULL          |
|Carol|[email protected]|NULL          |
+-----+-----------------+--------------+

And also, a similar implementation in Datafusion is array_element also return null if the index goes out of range

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is how it works on main:

> select strings['not_found'] from 'datafusion/core/tests/data/parquet_map.parquet';
0 row(s) fetched.
Elapsed 0.006 seconds.

Here is how it works on this PR (aka has a single row for each input row)

DataFusion CLI v37.1.0
> select strings['not_found'] from '../datafusion/core/tests/data/parquet_map.parquet';
+----------------------------------------------------------------------+
| ../datafusion/core/tests/data/parquet_map.parquet.strings[not_found] |
+----------------------------------------------------------------------+
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
| .                                                                    |
| .                                                                    |
| .                                                                    |
+----------------------------------------------------------------------+
209 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 0.033 seconds.

let map_array = as_map_array(array.as_ref())?;
let key_scalar = Scalar::new(StringArray::from(vec![k.clone()]));
let keys = arrow::compute::kernels::cmp::eq(&key_scalar, map_array.keys())?;
let entries = arrow::compute::filter(map_array.entries(), &keys)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using filter will reduce the number of input rows to the number of rows that have keys matching the input key. But we want to respect the number of input rows, and give null for any rows not having the matching key

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this

If the input is like this (two rows, each three elements)

{ a: 1, b: 2, c: 100}
{ a: 3, b: 4, c: 200}

An expression like col['c'] will still return 2 rows (but each row will have only a single element)

{ c: 100 }
{ c: 200 }

Copy link
Contributor Author

@duongcongtoai duongcongtoai Apr 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous implememtation

map_array.entries() has type of

pub struct StructArray {
    len: usize,
    data_type: DataType,
    nulls: Option<NullBuffer>,
    fields: Vec<ArrayRef>,
}

With the example above, the layout of field "fields" will be a vector of 2 array, where first array is a list of key, and second array is a list of value

[0]: ["a","b","c","a","b",c"]
[1]: [1,2,100,3,4,200]
                    let keys = arrow::compute::kernels::cmp::eq(&key_scalar, map_array.keys())?;

with this computation, the result is a boolean aray where "key" = "c"

[false,false,true,false,false,true]

and thus this operation will reduce the number of rows into

                    let entries = arrow::compute::filter(map_array.entries(), &keys)?;
[0]: ["c,"c"]
[1]: [100,200]

Problem

However, let's add a row where the map does not have key "c" in between

{ a: 1, b: 2, c: 100}
{ a: 1, b: 2}
{ a: 3, b: 4, c: 200}

map_array.entries() underneath is represented as

[0]: ["a,"b","c","a","b","a","b","c"]
[1]: [1,2,100,1,2,3,4,200]

                    let entries = arrow::compute::filter(map_array.entries(), &keys)?;
Now rows after filtered will be
[0]: ["c","c"]
[1]: [100,200]

and the return result will be

{ c: 100 }
{ c: 200 }

instead of

{ c: 100 }
null
{ c: 200 }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the result of evaluating col[b] on

{ a: 1, b: 2, c: 100}
{ a: 1, b: 2}
{ a: 3, b: 4, c: 200}

to be:

{ c: 100 }
null
{ c: 200 }

For example, in duckdb:

D create table foo as values (MAP {'a':1, 'b':2, 'c':100}), (MAP{ 'a':1, 'b':2}), (MAP {'a':1, 'b':2, 'c':200});
D select * from foo;
┌───────────────────────┐
│         col0          │
│ map(varchar, integer) │
├───────────────────────┤
│ {a=1, b=2, c=100}     │
│ {a=1, b=2}            │
│ {a=1, b=2, c=200}     │
└───────────────────────┘
D select col0['c'] from foo;
┌───────────┐
│ col0['c'] │
│  int32[]  │
├───────────┤
│ [100]     │
│ []        │
│ [200]     │
└───────────┘

Basically a scalar function has the invarant that each input row produces exactly 1 output row

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also explained in this discussion: #10148 (comment)

@jayzhan211 jayzhan211 changed the title Minor: Validate UDF input and ouput size Validate ScalarUDF output rows and support nulls for array_has and get_field for Map Apr 26, 2024
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Apr 26, 2024
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work here @duongcongtoai and @jayzhan211 . I don't think this PR handles scalar values correctly -- I left some comments explaining why. Let me know what you think

if fun.validate_number_of_rows() {
let output_size = match &output {
ColumnarValue::Array(array) => array.len(),
ColumnarValue::Scalar(_) => 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this logic is correct -- specifically ColumnarValue::Scalar represents (logically) a single value for all the rows.

Thus I think if the function returns ColumnarValue::Scalar that implicitly can be any number of rows.

Therefore I think the check could be something like

if let ColumnarValue::Array(array) = &output {
  if output_size != array.len() {
    return internal_err...
  }
}

I'll try and make a PR to improve the documentation on ColumnarValue to make this clearer: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.ColumnarValue.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, so we don't need to check the length if the output is a scalar value, we actually had a discussion here: #10148 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a PR to improve the docs #10265

Copy link
Contributor Author

@duongcongtoai duongcongtoai Apr 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh then perhaps the current implementation of arrow_typeof can be wrong, if we have a map like this:

{id:1,name:"bob"}
{id:"string_id",name:"alice"}

then arrow_typeof(map.id) will returns

int64
int64

Is it possible to have this kind of input, because a map's schema can be dynamic per row

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I agree that we can check with Array only.

Copy link
Contributor

@jayzhan211 jayzhan211 Apr 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is map key allowed to have different types, which db has the similar model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha i 'v read duckdb's map definition again

A dictionary of multiple named values, each key having the same type and each value having the same type. Keys and values can be any type and can be different types from one another

I am new to this kind of project, but i think we don't have and should not have that use case 🤔. Else it will affect alot of execution logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Duckdb, map keys can be different value for each row, but the types should not be different.
In OLAP which has the columnar format (we use arrow), it does not make sense to have different types on different rows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAPs must have a single type for all keys, and a single type for all values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -69,4 +69,8 @@ impl ScalarUDFImpl for ArrowTypeOfFunc {
"{input_data_type}"
))))
}

fn validate_number_of_rows(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct -- arrow_typeof always returns the same number of rows as its input...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then i think this new method is no longer needed

let map_array = as_map_array(array.as_ref())?;
let key_scalar = Scalar::new(StringArray::from(vec![k.clone()]));
let keys = arrow::compute::kernels::cmp::eq(&key_scalar, map_array.keys())?;
let entries = arrow::compute::filter(map_array.entries(), &keys)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this

If the input is like this (two rows, each three elements)

{ a: 1, b: 2, c: 100}
{ a: 3, b: 4, c: 200}

An expression like col['c'] will still return 2 rows (but each row will have only a single element)

{ c: 100 }
{ c: 200 }

@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Apr 27, 2024
@duongcongtoai
Copy link
Contributor Author

Requested change made, please help me check again @alamb

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @duongcongtoai -- this looks good to me. Thank you @jayzhan211 for the review

All in all this is a really nice change 🏆 : thank you for fixing the underlying problem that the supposedly easy-to-add check exposed

.err()
.unwrap()
.to_string(),
"UDF returned a different number of rows than expected"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌 -- very nice

@@ -44,6 +44,7 @@ DELETE 24
query T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to remind myself what this data looked like. Here it is for anyone else who may be interested

DataFusion CLI v37.1.0
> select * from 'datafusion/core/tests/data/parquet_map.parquet';
+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+
| ints           | strings                                                                                                                                                                                               | timestamp            |
+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+
| {bytes: 38906} | {host: 198.194.132.41, method: GET, protocol: HTTP/1.0, referer: https://some.com/this/endpoint/prints/money, request: /observability/metrics/production, status: 400, user-identifier: shaneIxD}     | 06/Oct/2023:17:53:45 |
| {bytes: 44606} | {host: 140.115.224.194, method: PATCH, protocol: HTTP/1.0, referer: https://we.org/user/booperbot124, request: /booper/bopper/mooper/mopper, status: 304, user-identifier: jesseddy}                  | 06/Oct/2023:17:53:45 |
| {bytes: 23517} | {host: 63.69.43.67, method: GET, protocol: HTTP/2.0, referer: https://random.net/booper/bopper/mooper/mopper, request: /booper/bopper/mooper/mopper, status: 550, user-identifier: jesseddy}          | 06/Oct/2023:17:53:45 |
| {bytes: 44876} | {host: 69.4.253.156, method: PATCH, protocol: HTTP/1.1, referer: https://some.net/booper/bopper/mooper/mopper, request: /user/booperbot124, status: 403, user-identifier: Karimmove}                  | 06/Oct/2023:17:53:45 |
| {bytes: 34122} | {host: 239.152.196.123, method: DELETE, protocol: HTTP/2.0, referer: https://for.com/observability/metrics/production, request: /apps/deploy, status: 403, user-identifier: meln1ks}                  | 06/Oct/2023:17:53:45 |
| {bytes: 37438} | {host: 95.243.186.123, method: DELETE, protocol: HTTP/1.1, referer: https://make.de/wp-admin, request: /wp-admin, status: 550, user-identifier: Karimmove}                                            | 06/Oct/2023:17:53:45 |
| {bytes: 45784} | {host: 66.142.251.66, method: PUT, protocol: HTTP/2.0, referer: https://some.org/apps/deploy, request: /secret-info/open-sesame, status: 403, user-identifier: benefritz}                             | 06/Oct/2023:17:53:45 |
| {bytes: 27788} | {host: 157.85.140.215, method: GET, protocol: HTTP/1.1, referer: https://random.de/booper/bopper/mooper/mopper, request: /booper/bopper/mooper/mopper, status: 401, user-identifier: devankoshal}     | 06/Oct/2023:17:53:45 |
| {bytes: 5344}  | {host: 62.191.179.3, method: POST, protocol: HTTP/1.0, referer: https://random.org/booper/bopper/mooper/mopper, request: /observability/metrics/production, status: 400, user-identifier: jesseddy}   | 06/Oct/2023:17:53:45 |
| {bytes: 9136}  | {host: 237.213.221.20, method: PUT, protocol: HTTP/2.0, referer: https://some.us/this/endpoint/prints/money, request: /observability/metrics/production, status: 304, user-identifier: ahmadajmi}     | 06/Oct/2023:17:53:46 |
| {bytes: 5640}  | {host: 38.148.115.2, method: GET, protocol: HTTP/1.0, referer: https://for.net/apps/deploy, request: /do-not-access/needs-work, status: 301, user-identifier: benefritz}                              | 06/Oct/2023:17:53:46 |
...

@@ -44,6 +44,7 @@ DELETE 24
query T
SELECT strings['not_found'] FROM data LIMIT 1;
----
NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is how it works on main:

> select strings['not_found'] from 'datafusion/core/tests/data/parquet_map.parquet';
0 row(s) fetched.
Elapsed 0.006 seconds.

Here is how it works on this PR (aka has a single row for each input row)

DataFusion CLI v37.1.0
> select strings['not_found'] from '../datafusion/core/tests/data/parquet_map.parquet';
+----------------------------------------------------------------------+
| ../datafusion/core/tests/data/parquet_map.parquet.strings[not_found] |
+----------------------------------------------------------------------+
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
|                                                                      |
| .                                                                    |
| .                                                                    |
| .                                                                    |
+----------------------------------------------------------------------+
209 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 0.033 seconds.

@@ -5197,8 +5199,9 @@ false false false true
true false true false
true false false true
false true false false
false false false false
false false false false
NULL NULL false false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iikewise I agree this should have 7 output rows

statement ok
CREATE TABLE fixed_size_arrays
AS VALUES
(arrow_cast(make_array(make_array(NULL, 2),make_array(3, NULL)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(1.1, 2.2, 3.3), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('L', 'o', 'r', 'e', 'm'), 'FixedSizeList(5, Utf8)')),
(arrow_cast(make_array(make_array(3, 4),make_array(5, 6)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(NULL, 5.5, 6.6), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('i', 'p', NULL, 'u', 'm'), 'FixedSizeList(5, Utf8)')),
(arrow_cast(make_array(make_array(5, 6),make_array(7, 8)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(7.7, 8.8, 9.9), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('d', NULL, 'l', 'o', 'r'), 'FixedSizeList(5, Utf8)')),
(arrow_cast(make_array(make_array(7, NULL),make_array(9, 10)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(10.1, NULL, 12.2), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('s', 'i', 't', 'a', 'b'), 'FixedSizeList(5, Utf8)')),
(NULL, arrow_cast(make_array(13.3, 14.4, 15.5), 'FixedSizeList(3, Float64)'), arrow_cast(make_array('a', 'm', 'e', 't', 'x'), 'FixedSizeList(5, Utf8)')),
(arrow_cast(make_array(make_array(11, 12),make_array(13, 14)), 'FixedSizeList(2, List(Int64))'), NULL, arrow_cast(make_array(',','a','b','c','d'), 'FixedSizeList(5, Utf8)')),
(arrow_cast(make_array(make_array(15, 16),make_array(NULL, 18)), 'FixedSizeList(2, List(Int64))'), arrow_cast(make_array(16.6, 17.7, 18.8), 'FixedSizeList(3, Float64)'), NULL)
;

@@ -5183,8 +5184,9 @@ false false false true
true false true false
true false false true
false true false false
false false false false
false false false false
NULL NULL false false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked and the arrays table has 7 rows, so I agree the correct answer has 7 output rows as well

statement ok
CREATE TABLE arrays
AS VALUES
(make_array(make_array(NULL, 2),make_array(3, NULL)), make_array(1.1, 2.2, 3.3), make_array('L', 'o', 'r', 'e', 'm')),
(make_array(make_array(3, 4),make_array(5, 6)), make_array(NULL, 5.5, 6.6), make_array('i', 'p', NULL, 'u', 'm')),
(make_array(make_array(5, 6),make_array(7, 8)), make_array(7.7, 8.8, 9.9), make_array('d', NULL, 'l', 'o', 'r')),
(make_array(make_array(7, NULL),make_array(9, 10)), make_array(10.1, NULL, 12.2), make_array('s', 'i', 't')),
(NULL, make_array(13.3, 14.4, 15.5), make_array('a', 'm', 'e', 't')),
(make_array(make_array(11, 12),make_array(13, 14)), NULL, make_array(',')),
(make_array(make_array(15, 16),make_array(NULL, 18)), make_array(16.6, 17.7, 18.8), NULL)
;

@alamb alamb changed the title Validate ScalarUDF output rows and support nulls for array_has and get_field for Map Validate ScalarUDF output rows and fix nulls for array_has and get_field for Map Apr 29, 2024
@alamb alamb merged commit 0f2a68e into apache:main Apr 29, 2024
24 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check the udf output size which should be equal to the input size
3 participants