Implement `array_slice` and `array_element`, remove `array_trim` #6936

izveigor · 2023-07-12T19:06:18Z

Which issue does this PR close?

Part of #6935
Closes #7147
Closes #6975
Closes #6974

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

…ay_slice

izveigor · 2023-07-28T19:39:57Z

datafusion/physical-expr/src/expressions/get_indexed_field.rs

+                (DataType::List(_), DataType::Int64, DataType::Int64)
+                | (DataType::List(_), DataType::Int64, DataType::Null)
+                | (DataType::List(_), DataType::Null, DataType::Int64)
+                | (DataType::List(_), DataType::Null, DataType::Null) => Ok(ColumnarValue::Array(array_slice(&[


It does not work with columns, but all functions support column format.
@alamb if you have free time can you give me some advices how to solve the problem?

I think this would entail supporting something like SELECT array_column[int_column] FROM t which while cool I think would be much less common than SELECT array_column[3] FROM t (aka a constant) as you have done here

I think writing the code that takes columnar inputs is likely something we could work out upstream (aka in arrow-rs) as a list_take or similar kernel.

Thus I would suggest we get this PR in (which errors with a non constant) and file a ticket to support columnar field access as a follow on PR

izveigor · 2023-07-28T19:46:30Z

datafusion/proto/src/logical_plan/to_proto.rs

@@ -951,11 +951,24 @@ impl TryFrom<&Expr> for protobuf::LogicalExprNode {
                // see discussion in https://github.com/apache/arrow-datafusion/issues/2565
                return Err(Error::General("Proto serialization error: Expr::ScalarSubquery(_) | Expr::InSubquery(_) | Expr::Exists { .. } | Exp:OuterReferenceColumn not supported".to_string()));
            }
-            Expr::GetIndexedField(GetIndexedField { key, expr }) => Self {
+            Expr::GetIndexedField(GetIndexedField {


I don't know how to parse GetIndexedField in proto.
Should I create a separate ticket for this issue and allow other users to solve it or this problem may cause many problems in the future?

If you can't figure out the protobuf part, I think we have in the past added a "NotImplemented" error and filed a follow on ticket to add support

Though maybe this comment is outdated as it looks like you have the code here

I have created a separate ticket: #7146.

izveigor · 2023-07-28T19:53:27Z

I think PR is enough for an initial review. @alamb @jayzhan211 if you have free time, can you review it?

jayzhan211 · 2023-07-29T13:33:56Z

datafusion/core/tests/sqllogictests/test_files/array.slt

+query error
+select array_slice(make_array(1, 2, 3, 4, 5), 2, NULL), array_slice(make_array('h', 'e', 'l', 'l', 'o'), 3, NULL);
+----
+[4, 5] [l, l, o]


Is this intended?

This implementation is currently not working. See the issue: #7142.

jayzhan211 · 2023-07-29T13:34:12Z

datafusion/core/tests/sqllogictests/test_files/array.slt

+[4, 5] [l, l, o]
+
+# array_slice scalar function #13 (with NULL and negative number)
+query error


jayzhan211 · 2023-07-29T13:35:55Z

datafusion/core/tests/sqllogictests/test_files/array.slt

+----
+[] []
+
+# array_slice scalar function #19 (with negative indexes; nested array)


Does not seems any negative indexes

jayzhan211 · 2023-07-29T13:36:52Z

datafusion/core/tests/sqllogictests/test_files/array.slt

+----
+[[1, 2, 3, 4, 5]]
+
+# array_slice scalar function #20 (with first positive index and last negative index)


first neg then pos?

jayzhan211 · 2023-07-29T13:37:04Z

datafusion/core/tests/sqllogictests/test_files/array.slt

+----
+[4, 5] [l, l]
+
+# array_slice scalar function #21 (with first negative index and last positive index)


first pos then neg?

jayzhan211 · 2023-07-29T13:39:02Z

datafusion/core/tests/sqllogictests/test_files/struct.slt

+
+# struct[i]
+query III
+select struct(1, 3.14, 'h')['c0'], struct(3, 2.55, 'b')['c1'], struct(2, 6.43, 'a')['c2'];


quite curious, why index with 'c0' works

Because we start counting from 0.
Example:

❯ CREATE TABLE values( a INT, b INT, c INT, d FLOAT, e VARCHAR, f VARCHAR ) AS VALUES (1, 1, 2, 1.1, 'Lorem', 'A'), (2, 3, 4, 2.2, 'ipsum', ''), (3, 5, 6, 3.3, 'dolor', 'BB'), (4, 7, 8, 4.4, 'sit', NULL), (NULL, 9, 10, 5.5, 'amet', 'CCC'), (5, NULL, 12, 6.6, ',', 'DD'), (6, 11, NULL, 7.7, 'consectetur', 'E'), (7, 13, 14, NULL, 'adipiscing', 'F'), (8, 15, 16, 8.8, NULL, '') ;

❯ select struct(a, b, c) from values; +------------------------------------+ | struct(values.a,values.b,values.c) | +------------------------------------+ | {c0: 1, c1: 1, c2: 2} | | {c0: 2, c1: 3, c2: 4} | | {c0: 3, c1: 5, c2: 6} | | {c0: 4, c1: 7, c2: 8} | | {c0: , c1: 9, c2: 10} | | {c0: 5, c1: , c2: 12} | | {c0: 6, c1: 11, c2: } | | {c0: 7, c1: 13, c2: 14} | | {c0: 8, c1: 15, c2: 16} | +------------------------------------+

I can understand 0, but is c hard coding somewhere?

Lets me ask another way, can we index it with col0, column0? If not, there might be a hard coding index with the prefix c.

The c0 is hardcoded and part of the output of the struct function:

❯ select struct(1,2); +---------------------------+ | struct(Int64(1),Int64(2)) | +---------------------------+ | {c0: 1, c1: 2} | +---------------------------+

jayzhan211 · 2023-07-30T03:47:56Z

datafusion/core/tests/sql/select.rs

+        "+--------+",
+        "| i0     |",
+        "+--------+",
+        "| [0, 1] |",


Not sure why only one left.

See the ticket: #7147

jayzhan211 · 2023-07-31T11:41:14Z

datafusion/core/tests/sqllogictests/test_files/array.slt

@@ -591,103 +611,109 @@ select array_slice(make_array(1, 2, 3, 4, 5), 2, 6), array_slice(make_array('h',
 ----
 [2, 3, 4, 5] [l, l, o]

-# array_slice scalar function #6 (with zero and positive number)
+# array_slice scalar function #6 (with negative indexes; nested array)


positive indexes?

jayzhan211

LGTM!

izveigor · 2023-07-31T18:26:36Z

@alamb I fixed conflicts and CI, so can you review this PR again if you have free time?

alamb

Thank you for the effort @izveigor -- I don't think we should merge this PR with the regression in field access #7147 -- among other things we use this feature in IOx so adding a known regression seems like a step backwards

I also don't fully understand the implementation of slice -- I think there is a significantly easier and general way to implement this using take which I tried to write up in detal

Let me know what you think

alamb · 2023-07-31T17:27:13Z

datafusion/core/tests/user_defined/user_defined_aggregates.rs

@@ -152,21 +152,6 @@ async fn test_udaf_returning_struct() {
    assert_batches_eq!(expected, &execute(&ctx, sql).await.unwrap());
 }

-/// Demonstrate extracting the fields from a structure using a subquery


Why was this test removed?

(this is an important usecase for us in IOx where we have a UDF that returns a struct that is accessed by name.

datafusion/expr/src/field_util.rs

alamb · 2023-07-31T17:36:01Z

datafusion/physical-expr/src/array_expressions.rs

@@ -2550,69 +3060,6 @@ mod tests {
        assert_eq!("1-*-3-*-*-6-7-*", result.value(0));
    }

-    #[test]


why was this test removed?

Update: the answer is that trim_array was removed

alamb · 2023-07-31T18:26:35Z

datafusion/physical-expr/src/array_expressions.rs

+    }};
+}
+
+macro_rules! slice {


I didn't fully understand this implementation -- it seem like picking a single value from a list array can be implemented by calculating the appropriate offsets and then calling take

For example, given A ListArray like this

┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┌ ─ ─ ─ ─ ─ ─ ┐ │ ┌─────────────┐ ┌───────┐ │ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │ [A,B,C] │ │ (0,3) │ │ 1 │ │ 0 │ │ │ 1 │ │ A │ │ 0 │ ├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤ │ [] (empty) │ │ (3,3) │ │ 1 │ │ 3 │ │ │ 1 │ │ B │ │ 1 │ ├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤ │ NULL │ │ (3,4) │ │ 0 │ │ 4 │ │ │ 1 │ │ C │ │ 2 │ ├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤ │ [D] │ │ 4,5 │ │ 1 │ │ 5 │ │ │ 1 │ │ ? │ │ 3 │ ├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤ │ [NULL, F] │ │ 5,7 │ │ 1 │ │ 5 │ │ │ 0 │ │ D │ │ 4 │ └─────────────┘ └───────┘ │ └───┘ ├───┤ ├───┤ ├───┤ │ 7 │ │ │ 1 │ │ ? │ │ 5 │ │ Validity └───┘ ├───┤ ├───┤ Logical Logical (nulls) Offsets │ │ 1 │ │ F │ │ 6 │ Values Offsets │ └───┘ └───┘ │ Values │ │ (offsets[i], │ ListArray (Array) offsets[i+1]) └ ─ ─ ─ ─ ─ ─ ┘ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

Computing col[1] (aka take the second element) could be computed by adding 1 to the first logical offset, checking if that is still in the array

So in this example, the take indicies would be

[ Some(1), (add 1 to start of (0, 3) is 1 None , // (adding 1 start of (3,3) is 4 which is outside, None, // input was null, None, // adding 1 start of (4,5 ) is 5 which is outside Some(6), // adding 1 to start of (5,7) is 6 ]

This calculation does not depend on datatype of the elements and would be fully general

A similar calculation could be done for slice though in that case you would have to build up the offsets of the new list array as well as the indices of the take of the values.

In general I think many of these array calculations and manipulations can be done in terms of the take kernel

I don't quite comprehend your comment. If I use take kernel with ListArray, the program returns me the elements of ListArray (i. e. Array instances). So if it's really worth it can you give me more information how should I take data?

Example:

use arrow::array::*; use arrow::compute::take; use arrow::datatypes::Int32Type; fn main() { let data = vec![ Some(vec![Some(0), Some(1), Some(2)]), None, Some(vec![Some(3), None, Some(5)]), Some(vec![Some(6), Some(7)]), ]; let values = ListArray::from_iter_primitive::<Int32Type, _, _>(data); let indices = UInt32Array::from(vec![0, 3]); let taken = take(&values, &indices, None).unwrap(); let taken: &GenericListArray<i32> = taken.as_list(); let data = vec![ Some(vec![Some(0), Some(1), Some(2)]), Some(vec![Some(6), Some(7)]), ]; let expected = ListArray::from_iter_primitive::<Int32Type, _, _>(data); assert_eq!(expected, taken.clone()); }

I don't quite comprehend your comment. If I use take kernel with ListArray, the program returns me the elements of ListArray (i. e. Array instances). So if it's really worth it can you give me more information how should I take data?

I was thinking we call take() with the values array (like take(list_array.values(), indices)

So in your example above, something more like

fn main() { let list_array = todo!(); // define ListArray here let data = vec![ Some(vec![Some(0), Some(1), Some(2)]), None, Some(vec![Some(3), None, Some(5)]), Some(vec![Some(6), Some(7)]), ]; let indices = UInt32::from_iter_primitive::<Int32Type, _, _>(data); let taken = take(&list_array.values(), &indices, None).unwrap();

izveigor · 2023-07-31T18:51:50Z

Thank you for the effort @izveigor -- I don't think we should merge this PR with the regression in field access #7147 -- among other things we use this feature in IOx so adding a known regression seems like a step backwards

I am fully accept the fact that we should prevent the regression.

I didn't fully understand this implementation -- it seem like picking a single value from a list array can be implemented by calculating the appropriate offsets and then calling take

I plan for examine the difference between array.slice() and take tomorrow.

izveigor · 2023-08-01T18:33:25Z

@alamb I think we should remain array.slice() instead of take kernel. There are some reasons:

I didn't find the optimal decision for saving the structure for array_slice and array_value (use common parts of the code for both expressions) (because array_value returns NULL statement instead of empty array, and if we don't want to inflate code, we should remain array.slice() statement) (anyway, array_value would use the same way like the current implementation)
I don't think the current implementation is more slowly than take analog.
if take analogue could theoretically be faster, we could fix it at any time.

I understood my mistake about index column support. Now, in my opinion I should direct all forces to fix the bug with struct (which is very important for IOx). I think it is better development path than comparing take and slice.

The main problem with bug related to struct is difference between circumstances for List and Struct cases. For example, DuckDB can accept column support for List statement (like: select make_array(1, 2, 3, 4, 5)[a] from index)), but can not for Struct. So, I can solve this problem by using union struct (struct with two optional options):

#[derive(Clone, PartialEq, Eq, Hash, Debug)]
pub struct GetIndexedFieldKey {
    ///
    pub list_key: Option<Expr>,
    /// 
    pub struct_key: Option<ScalarValue>,
}

impl GetIndexedFieldKey {
    /// Create a new GetIndexedFieldKey expression
    pub fn new(list_key: Option<Expr>, struct_key: Option<ScalarValue>) -> Self {
        // value must be either `list_key` or `struct_key`
        assert_ne!(list_key.is_some(), struct_key.is_some());
        assert_ne!(list_key.is_none(), struct_key.is_none());

        Self {
            list_key,
            struct_key,
        }
    }
}

/// expression to get a field of a struct array.
#[derive(Debug, Hash)]
pub struct GetIndexedFieldExpr {
    /// The expression to find
    arg: Arc<dyn PhysicalExpr>,
    /// The key statement
    key: GetIndexedFieldKey,
    /// The extra key (it can be used only with `DataType::List`)
    extra_key: Option<Arc<dyn PhysicalExpr>>,
}

What do you think about that approach, @alamb?

P.S. I really want that DataFusion index system matches with DuckDB and ClickHouse.

izveigor · 2023-08-02T19:43:42Z

@alamb @jayzhan211 can you review it again if you have free time.
I fixed all regressions: now there is no corrupted test.
Also I remind you that the function array_trim has been deprecated.

alamb · 2023-08-03T15:05:47Z

I plan to review this PR later today

alamb

Thank you very much @izveigor -- I think this looks good to go -- it is well tested and documented. I do think there is

I plan to run this PR through the IOx tests as one last verification, but then I think we should merge it.

Thank you for pushing this through 🦾

cc @vincev @smiklos who I think are also interested in this area

alamb · 2023-08-03T19:32:24Z

datafusion/core/tests/sqllogictests/test_files/struct.slt

+
+# struct[i]
+query III
+select struct(1, 3.14, 'h')['c0'], struct(3, 2.55, 'b')['c1'], struct(2, 6.43, 'a')['c2'];


The c0 is hardcoded and part of the output of the struct function:

❯ select struct(1,2); +---------------------------+ | struct(Int64(1),Int64(2)) | +---------------------------+ | {c0: 1, c1: 2} | +---------------------------+

alamb · 2023-08-03T19:34:00Z

docs/source/user-guide/expressions.md

 | array_to_string(array, delimeter)     | Converts each element to its text representation. `array_to_string([1, 2, 3, 4], ',') -> 1,2,3,4`                                                                        |
 | cardinality(array)                    | Returns the total number of elements in the array. `cardinality([[1, 2, 3], [4, 5, 6]]) -> 6`                                                                            |
 | make_array(value1, [value2 [, ...]])  | Returns an Arrow array using the specified input expressions. `make_array(1, 2, 3) -> [1, 2, 3]`                                                                         |
-| trim_array(array, n)                  | Removes the last n elements from the array.                                                                                                                              |
+| trim_array(array, n)                  | Deprecated                                                                                                                                                               |


alamb · 2023-08-03T19:34:14Z

datafusion/physical-expr/src/array_expressions.rs

@@ -2550,69 +3060,6 @@ mod tests {
        assert_eq!("1-*-3-*-*-6-7-*", result.value(0));
    }

-    #[test]


Update: the answer is that trim_array was removed

alamb · 2023-08-03T20:41:29Z

datafusion/expr/src/expr.rs

+/// If we use index with `DataType::Struct`, then we use the `struct_key` argument with `list_key` equal to `None`.
+/// `list_key` can be any expression, unlike `struct_key` which can only be `ScalarValue::Utf8`.
+#[derive(Clone, PartialEq, Eq, Hash, Debug)]
+pub struct GetIndexedFieldKey {


I found this API to be awkward -- I think we can do something to make it better. I am working on a PR now...

filed #7193 to track improving this API

jayzhan211 · 2023-08-04T12:12:59Z

datafusion/physical-expr/src/array_expressions.rs

@@ -1952,6 +2104,364 @@ mod tests {
        )
    }

+    #[test]
+    fn test_array_element() {


Nit: I think this kind of test is better to write in array.slt

jayzhan211 · 2023-08-04T12:19:18Z

datafusion/physical-expr/src/array_expressions.rs

+
+macro_rules! slice {
+    ($ARRAY:expr, $KEY:expr, $EXTRA_KEY:expr, $RETURN_ELEMENT:expr, $ARRAY_TYPE:ident) => {{
+        let sliced_array: Vec<Arc<dyn Array>> = $ARRAY


Nit: ArrayRef

jayzhan211

LGTM :)

alamb · 2023-08-04T15:14:23Z

I tested this with IOx test suite and everything passed: https://github.com/influxdata/influxdb_iox/pull/8419

While I do think the logical plan node to access fields can and should be cleaned up I see no reason that should gate this PR.

Nicely done @izveigor and thank you for all the help reviewing @jayzhan211

feat: reading array elements

282278d

izveigor marked this pull request as draft July 12, 2023 19:06

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jul 12, 2023

izveigor added 4 commits July 14, 2023 18:39

feat: array_element, array_slice, struct_extract

415be26

feat: support for positive and zero numbers for array_element and arr…

2f8e992

…ay_slice

feat: support negative numbers for array_element and array_slice

ebff134

feat: add more sql logic tests for array_element and array_slice

11a8d55

izveigor mentioned this pull request Jul 15, 2023

Column support for array functions #6804

Closed

11 tasks

izveigor marked this pull request as ready for review July 18, 2023 11:04

izveigor marked this pull request as draft July 18, 2023 11:04

izveigor added 9 commits July 18, 2023 14:08

Merge branch 'main' into index

862a5f4

feat: unit tests for get indexed fields

2145b7e

feat: sqllogictests for index

1782fed

docs: array_element and array_slice

ec8cfa5

feat: some TODOS

929f169

feat: proto

fe08230

Merge branch 'main' into index

2078d23

feat: proto

0fcd411

fix: tests

518a733

izveigor commented Jul 28, 2023

View reviewed changes

fix: query_nested_get_indexed_field_on_struct

7ad25bf

izveigor marked this pull request as ready for review July 28, 2023 19:53

jayzhan211 requested changes Jul 29, 2023

View reviewed changes

jayzhan211 reviewed Jul 30, 2023

View reviewed changes

jayzhan211 reviewed Jul 31, 2023

View reviewed changes

jayzhan211 approved these changes Jul 31, 2023

View reviewed changes

izveigor added 2 commits July 31, 2023 14:52

fix: typo

de5b671

test: remove test_udaf_returning_struct_subquery

74c69f2

alamb reviewed Jul 31, 2023

View reviewed changes

alamb mentioned this pull request Jul 31, 2023

Add more docs and examples for ListArray and OffsetsBuffer apache/arrow-rs#4607

Merged

feat: column support for list with single index

baef3f9

izveigor added 2 commits August 2, 2023 22:36

fix: list index does not support with columns

4fed361

fix: merge conflicts

b68342d

izveigor added 4 commits August 2, 2023 23:13

fix: proto

65dbcc4

fix: proto test

f91fb30

fix: clippy

502b2c7

fix: test_udaf_returning_struct_subquery

c754a29

izveigor requested review from jayzhan211 and alamb August 3, 2023 05:55

alamb changed the title ~~feat: reading array elements~~ Implement array_slice and array_element, remove array_trim Aug 3, 2023

alamb approved these changes Aug 3, 2023

View reviewed changes

alamb reviewed Aug 3, 2023

View reviewed changes

jayzhan211 reviewed Aug 4, 2023

View reviewed changes

jayzhan211 approved these changes Aug 4, 2023

View reviewed changes

alamb added the api change Changes the API exposed to users of the crate label Aug 4, 2023

alamb merged commit 2354758 into apache:main Aug 4, 2023
23 checks passed

alamb mentioned this pull request Aug 4, 2023

Improve the ergonomics of creating field and list array accesses #7193

Closed

This pull request was closed.

Implement array_slice and array_element, remove array_trim #6936

Implement array_slice and array_element, remove array_trim #6936

Conversation

izveigor commented Jul 12, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

izveigor commented Jul 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 left a comment

Choose a reason for hiding this comment

izveigor commented Jul 31, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

izveigor commented Jul 31, 2023

izveigor commented Aug 1, 2023

izveigor commented Aug 2, 2023

alamb commented Aug 3, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 left a comment • edited Loading

Choose a reason for hiding this comment

alamb commented Aug 4, 2023

Implement `array_slice` and `array_element`, remove `array_trim` #6936

Implement `array_slice` and `array_element`, remove `array_trim` #6936

izveigor commented Jul 12, 2023 •

edited

Loading

jayzhan211 Jul 31, 2023 •

edited

Loading

jayzhan211 Aug 4, 2023 •

edited

Loading

jayzhan211 left a comment •

edited

Loading