Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: New functions and operations for working with arrays #6384

Merged
merged 14 commits into from
Jun 6, 2023
Merged

feat: New functions and operations for working with arrays #6384

merged 14 commits into from
Jun 6, 2023

Conversation

izveigor
Copy link
Contributor

@izveigor izveigor commented May 18, 2023

Which issue does this PR close?

Closes #6119
Closes #6075.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

@izveigor izveigor marked this pull request as draft May 18, 2023 19:24
@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels May 18, 2023
array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
fun.volatility(),
),
BuiltinScalarFunction::ArrayAppend => Signature::any(2, fun.volatility()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there ways to use List and ARRAY_DATATYPES?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Given the element type of the list is part of its DataType, you probably can'y use the existing Signatures

Perhaps you could add a new Signature::any_list or something that would only check that the datatype matched DataType::list 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Given the element type of the list is part of its DataType, you probably can'y use the existing Signatures

Perhaps you could add a new Signature::any_list or something that would only check that the datatype matched DataType::list 🤔

}

/// Array_append SQL function
pub fn array_append(args: &[ColumnarValue]) -> Result<ColumnarValue> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should each function accept &[ColumnarValue] or ArrayRef? Is there a difference in these approaches?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that if you take ColumnarValue we could specialize the kernels to do something faster with scalar (single) values rather than expanding them out to arrays (aka making copies).

For the initial implementation I think converting them all to arrays is the best approach as it is simplest

};

let element = match &args[1] {
ColumnarValue::Scalar(scalar) => scalar.to_array().clone(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnarValue::Array also makes sense in this situation?

let res = match args[0].data_type() {
let data_type = args[0].data_type();
let res = match data_type {
DataType::List(..) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the ways how to implement FixedSizeList in all the functions, so I preferred to use List. I think it does not affect anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As FixedSizedList and List are different data types, if people have data that came from a Parquet file or something that is a FixedSizedList these functions likely wont work,

However, perhaps eventually we can add coercion rules to coerce (automatically cast) FixedSizeList to List

@@ -2785,73 +2807,6 @@ mod tests {
Ok(())
}

fn generic_test_array(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not work (FixedSizeList replaced with List). With what it can be connected?
Error:

left: `List(Field { name: "item", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })`,
right: `List(Field { name: "item", data_type: UInt64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })`'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure

@izveigor
Copy link
Contributor Author

@alamb I wonder if you have time to review this PR.
I left comments on which I would be pleased to hear your opinion.

@izveigor izveigor marked this pull request as ready for review May 29, 2023 20:50
@alamb
Copy link
Contributor

alamb commented May 30, 2023

Thank you @izveigor -- I have put this on my review list but I likely won't have a chance to review until tomorrow

@alamb
Copy link
Contributor

alamb commented May 31, 2023

I didn't make it to this today, but I plan to review it tomororw

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks really nice @izveigor -- thank you so much!

I haven't had a chance to review all the function implementations yet but the overall structure looks great to me . I am hoping to get @tustvold or someone else who is more of an expert in the arrow-rs structures here to offer an opinion on the structure of the kernels

I'll try and complete my review soon

## Array expressions Tests
#############

# array scalar function #1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are great @izveigor -- thank you so much

the only thing I recommend is adding some additional tests that have null in the lists.

array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
fun.volatility(),
),
BuiltinScalarFunction::ArrayAppend => Signature::any(2, fun.volatility()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Given the element type of the list is part of its DataType, you probably can'y use the existing Signatures

Perhaps you could add a new Signature::any_list or something that would only check that the datatype matched DataType::list 🤔

array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
fun.volatility(),
),
BuiltinScalarFunction::ArrayAppend => Signature::any(2, fun.volatility()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Given the element type of the list is part of its DataType, you probably can'y use the existing Signatures

Perhaps you could add a new Signature::any_list or something that would only check that the datatype matched DataType::list 🤔

let res = match args[0].data_type() {
let data_type = args[0].data_type();
let res = match data_type {
DataType::List(..) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As FixedSizedList and List are different data types, if people have data that came from a Parquet file or something that is a FixedSizedList these functions likely wont work,

However, perhaps eventually we can add coercion rules to coerce (automatically cast) FixedSizeList to List

let data_type = args[0].data_type();
let res = match data_type {
DataType::List(..) => {
let arrays =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold can you offer some suggestions on using the arrow-rs API to build list arrays? Is this the best way to use that API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)));
}

let arr = match &args[0] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

/// Array_append SQL function
pub fn array_append(args: &[ColumnarValue]) -> Result<ColumnarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that if you take ColumnarValue we could specialize the kernels to do something faster with scalar (single) values rather than expanding them out to arrays (aka making copies).

For the initial implementation I think converting them all to arrays is the best approach as it is simplest

@@ -2785,73 +2807,6 @@ mod tests {
Ok(())
}

fn generic_test_array(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure

let data_type = arrays[0].data_type();
match data_type {
DataType::List(..) => {
let list_arrays =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could just call to_data() I'm not sure this needs to downcast to ListArray

downcast_vec!(arrays, ListArray).collect::<Result<Vec<&ListArray>>>()?;
let len: usize = list_arrays.iter().map(|a| a.values().len()).sum();
let capacity = Capacities::Array(
list_arrays.iter().map(|a| a.get_buffer_memory_size()).sum(),
Copy link
Contributor

@tustvold tustvold Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
list_arrays.iter().map(|a| a.get_buffer_memory_size()).sum(),
list_arrays.iter().map(|a| a.len()).sum(),

The buffer memory size is fairly significant over estimate

}

/// Array_concat/Array_cat SQL function
pub fn array_concat(args: &[ColumnarValue]) -> Result<ColumnarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izveigor
Copy link
Contributor Author

izveigor commented Jun 2, 2023

Hello, @alamb!
I see your and @tustvold comments, thanks for your work!

I analyzed all the comments and came to the conclusion that it is better to implement all other changes in subsequent PR, if the current changes do not contain critical errors. (Because it will be easier to analyze changes and implement their)
So, I have made a list of issues for possible improvements to arrays:
arrow-rs:

  1. Should some of the features be implemented in arrow-rs (for example, position)?
    arrow-datafusion:
  2. [Important] Implement unnest function (it would allow arrays to use aggregate functions SELECT sum(a) AS total FROM (SELECT unnest(make_array(3, 5, 6) AS a) AS b;
  3. Support NULLS in arrays (not only NullArray) (I think it would be nice to rewrite make_array function with using try_new method)
  4. array_contains function (LIKE array[1, 2, 3] @> array[1, 1, 2, 3]
  5. Write a Signature method for list datatypes.
  6. Cast between arrays elements.
  7. Support empty array?
  8. Maybe, refactoring some functions if anyone finds a better solution.
  9. FixedSizeList to List

What do you think, @alamb?

@alamb
Copy link
Contributor

alamb commented Jun 5, 2023

What do you think, @alamb?

I think this would be ok -- especially as you have a history of continued contribution. However, there are a few instances where engnaged committers committed in the start of promising features (such as the analysis framework from @isidentical) and then were not able to to finish the work for whatever reason. While this is fine, I think it would be better for datafusion to avoid it.

Thus I would like to suggest an alternate approach which is to break this PR down into several smaller ones (perhaps one for each new function?) That way we can give each function the attention during review it deserves (and maybe even parallelize the work)

We have a much better track record of being able to review and merge smaller PRs quickly than single large PRs. So when the functionality can be split up I think that is the best plan.

What do you think @izveigor ?

@izveigor
Copy link
Contributor Author

izveigor commented Jun 5, 2023

In my opinion, it would be better to merge this PR, I have some arguments:

  1. This PR is completed, and the next PRs will be only improve it.
  2. If we break this PR down, I think it will increase production time.
    The main reason why I don't want to continue this PR is because I want to take a closer look at some issues, but they are mostly related to ready-made functions.
    So, I understand your concerns, but i think this way is better.
    What do you think, @alamb?

@alamb
Copy link
Contributor

alamb commented Jun 5, 2023

This PR is completed, and the next PRs will be only improve it.

I agree this PR is complete (with tests) and is not missing anything major

If we break this PR down, I think it will increase production time.

Yes, I agree breaking the PR down will require more effort on the author's (your) part. However, I do think if you have the time the effort would improve the overall quality of the DataFusion codebase. Finding bandwidth to maintain the code is the primary thing I think we struggle with as a community.

What do you think, @alamb?

I think we can merge this PR as long as the work you have planned is tracked by some tickets (so that if you don't have a chance to get to them at least we will have some institutional knowledge)

Is that acceptable?

@izveigor
Copy link
Contributor Author

izveigor commented Jun 5, 2023

I think this option will suit me.
Tomorrow I will try to describe in detail all the ideas in the tickets.

@izveigor
Copy link
Contributor Author

izveigor commented Jun 6, 2023

Hello, @alamb!

I have created issues regarding further improvements for working with arrays:
#6555
#6556
#6557
#6558
#6559
#6560
#6561

@alamb
Copy link
Contributor

alamb commented Jun 6, 2023

I have created issues regarding further improvements for working with arrays:

Thanks @izveigor -- I added them to #2326 as well. !

@alamb alamb merged commit 44b83a1 into apache:main Jun 6, 2023
@jackwener
Copy link
Member

This PR exist bug, related with #6596

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New functions for working with arrays Add array expressions
4 participants