Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement array_slice and array_element, remove array_trim #6936

Merged
merged 26 commits into from
Aug 4, 2023
Merged

Implement array_slice and array_element, remove array_trim #6936

merged 26 commits into from
Aug 4, 2023

Conversation

izveigor
Copy link
Contributor

@izveigor izveigor commented Jul 12, 2023

Which issue does this PR close?

Part of #6935
Closes #7147
Closes #6975
Closes #6974

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

@izveigor izveigor marked this pull request as draft July 12, 2023 19:06
@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jul 12, 2023
@izveigor izveigor mentioned this pull request Jul 15, 2023
11 tasks
@izveigor izveigor marked this pull request as ready for review July 18, 2023 11:04
@izveigor izveigor marked this pull request as draft July 18, 2023 11:04
(DataType::List(_), DataType::Int64, DataType::Int64)
| (DataType::List(_), DataType::Int64, DataType::Null)
| (DataType::List(_), DataType::Null, DataType::Int64)
| (DataType::List(_), DataType::Null, DataType::Null) => Ok(ColumnarValue::Array(array_slice(&[
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not work with columns, but all functions support column format.
@alamb if you have free time can you give me some advices how to solve the problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would entail supporting something like SELECT array_column[int_column] FROM t which while cool I think would be much less common than SELECT array_column[3] FROM t (aka a constant) as you have done here

I think writing the code that takes columnar inputs is likely something we could work out upstream (aka in arrow-rs) as a list_take or similar kernel.

Thus I would suggest we get this PR in (which errors with a non constant) and file a ticket to support columnar field access as a follow on PR

@@ -951,11 +951,24 @@ impl TryFrom<&Expr> for protobuf::LogicalExprNode {
// see discussion in https://github.com/apache/arrow-datafusion/issues/2565
return Err(Error::General("Proto serialization error: Expr::ScalarSubquery(_) | Expr::InSubquery(_) | Expr::Exists { .. } | Exp:OuterReferenceColumn not supported".to_string()));
}
Expr::GetIndexedField(GetIndexedField { key, expr }) => Self {
Expr::GetIndexedField(GetIndexedField {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to parse GetIndexedField in proto.
Should I create a separate ticket for this issue and allow other users to solve it or this problem may cause many problems in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can't figure out the protobuf part, I think we have in the past added a "NotImplemented" error and filed a follow on ticket to add support

Though maybe this comment is outdated as it looks like you have the code here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a separate ticket: #7146.

@izveigor
Copy link
Contributor Author

I think PR is enough for an initial review. @alamb @jayzhan211 if you have free time, can you review it?

@izveigor izveigor marked this pull request as ready for review July 28, 2023 19:53
query error
select array_slice(make_array(1, 2, 3, 4, 5), 2, NULL), array_slice(make_array('h', 'e', 'l', 'l', 'o'), 3, NULL);
----
[4, 5] [l, l, o]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is currently not working. See the issue: #7142.

[4, 5] [l, l, o]

# array_slice scalar function #13 (with NULL and negative number)
query error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too.

----
[] []

# array_slice scalar function #19 (with negative indexes; nested array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not seems any negative indexes

----
[[1, 2, 3, 4, 5]]

# array_slice scalar function #20 (with first positive index and last negative index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first neg then pos?

----
[4, 5] [l, l]

# array_slice scalar function #21 (with first negative index and last positive index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pos then neg?


# struct[i]
query III
select struct(1, 3.14, 'h')['c0'], struct(3, 2.55, 'b')['c1'], struct(2, 6.43, 'a')['c2'];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite curious, why index with 'c0' works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we start counting from 0.
Example:

❯ CREATE TABLE values(
  a INT,
  b INT,
  c INT,
  d FLOAT,
  e VARCHAR,
  f VARCHAR
) AS VALUES
  (1,    1,    2,    1.1,  'Lorem',       'A'),
  (2,    3,    4,    2.2,  'ipsum',       ''),
  (3,    5,    6,    3.3,  'dolor',       'BB'),
  (4,    7,    8,    4.4,  'sit',          NULL),
  (NULL, 9,    10,   5.5,  'amet',        'CCC'),
  (5,    NULL, 12,   6.6,  ',',           'DD'),
  (6,    11,   NULL, 7.7,  'consectetur', 'E'),
  (7,    13,   14,   NULL, 'adipiscing',  'F'),
  (8,    15,   16,   8.8,   NULL,          '')
;
❯ select struct(a, b, c) from values;
+------------------------------------+
| struct(values.a,values.b,values.c) |
+------------------------------------+
| {c0: 1, c1: 1, c2: 2}              |
| {c0: 2, c1: 3, c2: 4}              |
| {c0: 3, c1: 5, c2: 6}              |
| {c0: 4, c1: 7, c2: 8}              |
| {c0: , c1: 9, c2: 10}              |
| {c0: 5, c1: , c2: 12}              |
| {c0: 6, c1: 11, c2: }              |
| {c0: 7, c1: 13, c2: 14}            |
| {c0: 8, c1: 15, c2: 16}            |
+------------------------------------+

Copy link
Contributor

@jayzhan211 jayzhan211 Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand 0, but is c hard coding somewhere?

Lets me ask another way, can we index it with col0, column0? If not, there might be a hard coding index with the prefix c.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The c0 is hardcoded and part of the output of the struct function:

❯ select struct(1,2);
+---------------------------+
| struct(Int64(1),Int64(2)) |
+---------------------------+
| {c0: 1, c1: 2}            |
+---------------------------+

"+--------+",
"| i0 |",
"+--------+",
"| [0, 1] |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why only one left.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the ticket: #7147

@@ -591,103 +611,109 @@ select array_slice(make_array(1, 2, 3, 4, 5), 2, 6), array_slice(make_array('h',
----
[2, 3, 4, 5] [l, l, o]

# array_slice scalar function #6 (with zero and positive number)
# array_slice scalar function #6 (with negative indexes; nested array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

positive indexes?

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@izveigor
Copy link
Contributor Author

@alamb I fixed conflicts and CI, so can you review this PR again if you have free time?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the effort @izveigor -- I don't think we should merge this PR with the regression in field access #7147 -- among other things we use this feature in IOx so adding a known regression seems like a step backwards

I also don't fully understand the implementation of slice -- I think there is a significantly easier and general way to implement this using take which I tried to write up in detal

Let me know what you think

@@ -152,21 +152,6 @@ async fn test_udaf_returning_struct() {
assert_batches_eq!(expected, &execute(&ctx, sql).await.unwrap());
}

/// Demonstrate extracting the fields from a structure using a subquery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this test removed?

(this is an important usecase for us in IOx where we have a UDF that returns a struct that is accessed by name.

datafusion/expr/src/field_util.rs Show resolved Hide resolved
@@ -2550,69 +3060,6 @@ mod tests {
assert_eq!("1-*-3-*-*-6-7-*", result.value(0));
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this test removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: the answer is that trim_array was removed

}};
}

macro_rules! slice {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't fully understand this implementation -- it seem like picking a single value from a list array can be implemented by calculating the appropriate offsets and then calling take

For example, given A ListArray like this

                               ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 
                                                       ┌ ─ ─ ─ ─ ─ ─ ┐    │
 ┌─────────────┐  ┌───────┐    │     ┌───┐   ┌───┐       ┌───┐ ┌───┐       
 │   [A,B,C]   │  │ (0,3) │          │ 1 │   │ 0 │     │ │ 1 │ │ A │ │ 0  │
 ├─────────────┤  ├───────┤    │     ├───┤   ├───┤       ├───┤ ├───┤       
 │ [] (empty)  │  │ (3,3) │          │ 1 │   │ 3 │     │ │ 1 │ │ B │ │ 1  │
 ├─────────────┤  ├───────┤    │     ├───┤   ├───┤       ├───┤ ├───┤       
 │    NULL     │  │ (3,4) │          │ 0 │   │ 4 │     │ │ 1 │ │ C │ │ 2  │
 ├─────────────┤  ├───────┤    │     ├───┤   ├───┤       ├───┤ ├───┤       
 │     [D]     │  │  4,5  │          │ 1 │   │ 5 │     │ │ 1 │ │ ? │ │ 3  │
 ├─────────────┤  ├───────┤    │     ├───┤   ├───┤       ├───┤ ├───┤       
 │  [NULL, F]  │  │  5,7  │          │ 1 │   │ 5 │     │ │ 0 │ │ D │ │ 4  │
 └─────────────┘  └───────┘    │     └───┘   ├───┤       ├───┤ ├───┤       
                                             │ 7 │     │ │ 1 │ │ ? │ │ 5  │
                               │  Validity   └───┘       ├───┤ ├───┤       
    Logical       Logical         (nulls)   Offsets    │ │ 1 │ │ F │ │ 6  │
     Values       Offsets      │                         └───┘ └───┘       
                                                       │    Values   │    │
                (offsets[i],   │   ListArray               (Array)         
               offsets[i+1])                           └ ─ ─ ─ ─ ─ ─ ┘    │
                               └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 

Computing col[1] (aka take the second element) could be computed by adding 1 to the first logical offset, checking if that is still in the array

So in this example, the take indicies would be

[ 
  Some(1),  (add 1 to start of (0, 3) is 1
  None , //   (adding 1 start of (3,3) is 4 which is outside,
  None, // input was null,
  None,  // adding 1 start of (4,5 ) is 5 which is outside
  Some(6), // adding 1 to start of (5,7) is 6
]

This calculation does not depend on datatype of the elements and would be fully general

A similar calculation could be done for slice though in that case you would have to build up the offsets of the new list array as well as the indices of the take of the values.

In general I think many of these array calculations and manipulations can be done in terms of the take kernel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite comprehend your comment. If I use take kernel with ListArray, the program returns me the elements of ListArray (i. e. Array instances). So if it's really worth it can you give me more information how should I take data?

Example:

use arrow::array::*;
use arrow::compute::take;
use arrow::datatypes::Int32Type;

fn main() {
    let data = vec![
        Some(vec![Some(0), Some(1), Some(2)]),
        None,
        Some(vec![Some(3), None, Some(5)]),
        Some(vec![Some(6), Some(7)]),
    ];
    let values = ListArray::from_iter_primitive::<Int32Type, _, _>(data);

    let indices = UInt32Array::from(vec![0, 3]);
    let taken = take(&values, &indices, None).unwrap();
    let taken: &GenericListArray<i32> = taken.as_list();

    let data = vec![
        Some(vec![Some(0), Some(1), Some(2)]),
        Some(vec![Some(6), Some(7)]),
    ];
    let expected = ListArray::from_iter_primitive::<Int32Type, _, _>(data);
    assert_eq!(expected, taken.clone());
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite comprehend your comment. If I use take kernel with ListArray, the program returns me the elements of ListArray (i. e. Array instances). So if it's really worth it can you give me more information how should I take data?

I was thinking we call take() with the values array (like take(list_array.values(), indices)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in your example above, something more like

fn main() {
    let list_array = todo!(); // define ListArray here
    let data = vec![
        Some(vec![Some(0), Some(1), Some(2)]),
        None,
        Some(vec![Some(3), None, Some(5)]),
        Some(vec![Some(6), Some(7)]),
    ];
    let indices = UInt32::from_iter_primitive::<Int32Type, _, _>(data);
    let taken = take(&list_array.values(), &indices, None).unwrap();

@izveigor
Copy link
Contributor Author

Thank you for the effort @izveigor -- I don't think we should merge this PR with the regression in field access #7147 -- among other things we use this feature in IOx so adding a known regression seems like a step backwards

I am fully accept the fact that we should prevent the regression.

I didn't fully understand this implementation -- it seem like picking a single value from a list array can be implemented by calculating the appropriate offsets and then calling take

I plan for examine the difference between array.slice() and take tomorrow.

@izveigor
Copy link
Contributor Author

izveigor commented Aug 1, 2023

@alamb I think we should remain array.slice() instead of take kernel. There are some reasons:

  1. I didn't find the optimal decision for saving the structure for array_slice and array_value (use common parts of the code for both expressions) (because array_value returns NULL statement instead of empty array, and if we don't want to inflate code, we should remain array.slice() statement) (anyway, array_value would use the same way like the current implementation)
  2. I don't think the current implementation is more slowly than take analog.
  3. if take analogue could theoretically be faster, we could fix it at any time.

I understood my mistake about index column support. Now, in my opinion I should direct all forces to fix the bug with struct (which is very important for IOx). I think it is better development path than comparing take and slice.

The main problem with bug related to struct is difference between circumstances for List and Struct cases. For example, DuckDB can accept column support for List statement (like: select make_array(1, 2, 3, 4, 5)[a] from index)), but can not for Struct. So, I can solve this problem by using union struct (struct with two optional options):

#[derive(Clone, PartialEq, Eq, Hash, Debug)]
pub struct GetIndexedFieldKey {
    ///
    pub list_key: Option<Expr>,
    /// 
    pub struct_key: Option<ScalarValue>,
}

impl GetIndexedFieldKey {
    /// Create a new GetIndexedFieldKey expression
    pub fn new(list_key: Option<Expr>, struct_key: Option<ScalarValue>) -> Self {
        // value must be either `list_key` or `struct_key`
        assert_ne!(list_key.is_some(), struct_key.is_some());
        assert_ne!(list_key.is_none(), struct_key.is_none());

        Self {
            list_key,
            struct_key,
        }
    }
}
/// expression to get a field of a struct array.
#[derive(Debug, Hash)]
pub struct GetIndexedFieldExpr {
    /// The expression to find
    arg: Arc<dyn PhysicalExpr>,
    /// The key statement
    key: GetIndexedFieldKey,
    /// The extra key (it can be used only with `DataType::List`)
    extra_key: Option<Arc<dyn PhysicalExpr>>,
}

What do you think about that approach, @alamb?

P.S. I really want that DataFusion index system matches with DuckDB and ClickHouse.

@izveigor
Copy link
Contributor Author

izveigor commented Aug 2, 2023

@alamb @jayzhan211 can you review it again if you have free time.
I fixed all regressions: now there is no corrupted test.
Also I remind you that the function array_trim has been deprecated.

@alamb
Copy link
Contributor

alamb commented Aug 3, 2023

I plan to review this PR later today

@alamb alamb changed the title feat: reading array elements Implement array_slice and array_element, remove array_trim Aug 3, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @izveigor -- I think this looks good to go -- it is well tested and documented. I do think there is

I plan to run this PR through the IOx tests as one last verification, but then I think we should merge it.

Thank you for pushing this through 🦾

cc @vincev @smiklos who I think are also interested in this area


# struct[i]
query III
select struct(1, 3.14, 'h')['c0'], struct(3, 2.55, 'b')['c1'], struct(2, 6.43, 'a')['c2'];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The c0 is hardcoded and part of the output of the struct function:

❯ select struct(1,2);
+---------------------------+
| struct(Int64(1),Int64(2)) |
+---------------------------+
| {c0: 1, c1: 2}            |
+---------------------------+

| array_to_string(array, delimeter) | Converts each element to its text representation. `array_to_string([1, 2, 3, 4], ',') -> 1,2,3,4` |
| cardinality(array) | Returns the total number of elements in the array. `cardinality([[1, 2, 3], [4, 5, 6]]) -> 6` |
| make_array(value1, [value2 [, ...]]) | Returns an Arrow array using the specified input expressions. `make_array(1, 2, 3) -> [1, 2, 3]` |
| trim_array(array, n) | Removes the last n elements from the array. |
| trim_array(array, n) | Deprecated |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -2550,69 +3060,6 @@ mod tests {
assert_eq!("1-*-3-*-*-6-7-*", result.value(0));
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: the answer is that trim_array was removed

/// If we use index with `DataType::Struct`, then we use the `struct_key` argument with `list_key` equal to `None`.
/// `list_key` can be any expression, unlike `struct_key` which can only be `ScalarValue::Utf8`.
#[derive(Clone, PartialEq, Eq, Hash, Debug)]
pub struct GetIndexedFieldKey {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this API to be awkward -- I think we can do something to make it better. I am working on a PR now...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filed #7193 to track improving this API

@@ -1952,6 +2104,364 @@ mod tests {
)
}

#[test]
fn test_array_element() {
Copy link
Contributor

@jayzhan211 jayzhan211 Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this kind of test is better to write in array.slt


macro_rules! slice {
($ARRAY:expr, $KEY:expr, $EXTRA_KEY:expr, $RETURN_ELEMENT:expr, $ARRAY_TYPE:ident) => {{
let sliced_array: Vec<Arc<dyn Array>> = $ARRAY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ArrayRef

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@alamb alamb added the api change Changes the API exposed to users of the crate label Aug 4, 2023
@alamb
Copy link
Contributor

alamb commented Aug 4, 2023

I tested this with IOx test suite and everything passed: https://github.com/influxdata/influxdb_iox/pull/8419

While I do think the logical plan node to access fields can and should be cleaned up I see no reason that should gate this PR.

Nicely done @izveigor and thank you for all the help reviewing @jayzhan211

@alamb alamb merged commit 2354758 into apache:main Aug 4, 2023
23 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
3 participants