Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] General ticket for the concept of the practical implementation of ARRAY #6980

Closed
21 of 27 tasks
izveigor opened this issue Jul 15, 2023 · 19 comments
Closed
21 of 27 tasks
Labels
enhancement New feature or request

Comments

@izveigor
Copy link
Contributor

izveigor commented Jul 15, 2023

Is your feature request related to a problem or challenge?

It was decided that further development of array functions will be carried out according to the following concept: #6855.
According to the concept, any changes may be subject to rational criticism, so some aspects may change.
It should be noted that this list is incomplete and will be updated in the future.
The ticket contains many tickets. Each ticket is divided into respective categories:

List of features

A complete list of features, the implementation of which is desirable from the point of view of this concept 🎯:

Cleanups

Managing array elements:

By index:

By borders:

Working with Array Characteristics

Dimension

Set language:

Basic operations on sets:

Checking if elements belong:

Form an array:

Array aggregate functions:

Array operators:

Handling NULL and empty array:

Refactoring:

Other functions:

Bugs

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@izveigor izveigor added the enhancement New feature or request label Jul 15, 2023
@izveigor
Copy link
Contributor Author

cc @alamb @jayzhan211
I also send to people who have recently been interested in the topic of the array: @parkma99 @maxburke @bubbajoe

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 16, 2023

@izveigor, since there are array_intersect and array_except, did you find any function similar to array_union, the function that returns a list of all elements that exist in one of the l1 and l2 without duplicates? list_intersect([1,2,3], [3,4,5]) -> [1,2,3,4,5]. It seems to me a helpful set operation, but not sure why Duckdb does not have this one.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 16, 2023

It makes sense to allow nested array as the lhs value of three of the array_has function. We just need to check the element one by one.

The difference of array_has and array_has_all is the rhs value. The former is element (non-array), and the latter is array (one-dimension). Should we consider them as two different functions as Duckdb does?

I found Azure has just array_contains and array_overlap, I prefer their version.

array_contains(nested array, element or one-dimension array): Returns true if all element or element in the array exists in left-hand side array.
array_overlap(nested array, one-dimension array) the same as array_has_any but a better naming.

@izveigor
Copy link
Contributor Author

@izveigor, since there are array_intersect and array_except, did you find any function similar to array_union, the function that returns a list of all elements that exist in one of the l1 and l2 without duplicates? list_intersect([1,2,3], [3,4,5]) -> [1,2,3,4,5]. It seems to me a helpful set operation, but not sure why Duckdb does not have this one.

I think we can use Apache Spark SQL and Azure DataBricks function array_union (See: https://sparkbyexamples.com/spark/spark-sql-array-functions/ and https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/array_union). Yes, I agree that the full set of set operations is a very powerful tool.

It makes sense to allow nested array as the lhs value of three of the array_has function. We just need to check the element one by one.

Absolutely agree! I think this is the main mistake with PostgreSQL's array function set. Each element, including the array/list itself, must be processed by any function.

The difference of array_has and array_has_all is the rhs value. The former is element (non-array), and the latter is array (one-dimension). Should we consider them as two different functions as Duckdb does?

I prefer to use ClickHouse/DuckDB version of these functions, because there are a lot of possibilities for making full use of the set language (specifically we should check whether an element belongs to a set and check whether a set is a subset of a certain set).

@jayzhan211
Copy link
Contributor

I will work on three of array_has

@alamb alamb changed the title General ticket for the concept of the practical implementation of the array [Epic] General ticket for the concept of the practical implementation of ARRAY Aug 5, 2023
@jayzhan211
Copy link
Contributor

I would like to try on array_aggregate. https://duckdb.org/docs/sql/functions/nested#list-aggregates
There are many kinds of rewrite functions, but most of them are not supported yet. I think we can start with the easy one or an existing function, like sum, min, or max.

@jayzhan211
Copy link
Contributor

@izveigor , I dont fully understand about the difference between list_xxx and array_xxx in duckdb. As you said, list is for mutable, and array is for immutable. But like append a mutable function, why do they have array_append? contains a immutable function, but they have list_contains. For aggregate rewrite functions like list_min, list_sum, they don't alias with array, no array_min or array_sum, but we can do the same things with array_aggregate().

How do we differentiate list, and array in datafusion?

@izveigor
Copy link
Contributor Author

izveigor commented Aug 7, 2023

I would like to try on array_aggregate. https://duckdb.org/docs/sql/functions/nested#list-aggregates There are many kinds of rewrite functions, but most of them are not supported yet. I think we can start with the easy one or an existing function, like sum, min, or max.

It sounds great! Thank you, @jayzhan211! I created the tickets about this topic: #7213 and #7214.

@izveigor , I dont fully understand about the difference between list_xxx and array_xxx in duckdb. As you said, list is for mutable, and array is for immutable. But like append a mutable function, why do they have array_append? contains a immutable function, but they have list_contains. For aggregate rewrite functions like list_min, list_sum, they don't alias with array, no array_min or array_sum, but we can do the same things with array_aggregate().

How do we differentiate list, and array in datafusion?

I don't fully understand DuckDB principles myself. In my opinion every function should have list_ and array_ prefix due to the habit of using a single word. About mutability, this is purely my assumption of separation of functions, we may not adhere to it.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 7, 2023

I would like to try on array_aggregate. https://duckdb.org/docs/sql/functions/nested#list-aggregates There are many kinds of rewrite functions, but most of them are not supported yet. I think we can start with the easy one or an existing function, like sum, min, or max.

It sounds great! Thank you, @jayzhan211! I created the tickets about this topic: #7213 and #7214.

@izveigor , I dont fully understand about the difference between list_xxx and array_xxx in duckdb. As you said, list is for mutable, and array is for immutable. But like append a mutable function, why do they have array_append? contains a immutable function, but they have list_contains. For aggregate rewrite functions like list_min, list_sum, they don't alias with array, no array_min or array_sum, but we can do the same things with array_aggregate().
How do we differentiate list, and array in datafusion?

I don't fully understand DuckDB principles myself. In my opinion every function should have list_ and array_ prefix due to the habit of using a single word. About mutability, this is purely my assumption of separation of functions, we may not adhere to it.

I asked someone that is familiar with duckdb, he says the reason that they have list_xxx and array_xxx is due to compatibility, postgres for array_xxx, and others for list_xxx. In this case, since we heavily adopted from duckdb, clickhouse and others, we should also use similar naming from them, but I don't think mutability is the thing that differentiates list_xxx and array_xxx, it might be misleading if we said that.

If someone asks what is the difference between array_xxx and list_xxx in datafusion, I think we should answer they are the same thing, we just use the name from duckdb, clickhouse, etc!

@alamb
Copy link
Contributor

alamb commented Aug 7, 2023

I asked someone that is familiar with duckdb, he says the reason that they have list_xxx and array_xxx is due to compatibility, postgres for array_xxx, and others for list_xxx.

This would be great to add to the documentation if you had a chance -- thank you for the research @jayzhan211

@izveigor
Copy link
Contributor Author

izveigor commented Aug 7, 2023

@jayzhan211, thank you for the research. I agree with you and we should create the subpage of the documentation that represents our syntax (we diccussed it in #6855 (comment)). I will deal with it soon.

@edmondop
Copy link
Contributor

edmondop commented Sep 7, 2023

Can I pick something @jiangzhx that nobody else is looking to?

@izveigor
Copy link
Contributor Author

izveigor commented Sep 9, 2023

Hello, @edmondop!
Yes, you can pick all tickets if there are no assignments.

@izveigor
Copy link
Contributor Author

Hello, @alamb and @jayzhan211.
I would like to inform you that, unfortunately, due to life circumstances, I will no longer be able to take part in the project, or at least in the near future.
I hope that my ideas helped with the development of the project. It was a pleasure working with you and this experience really improved my programming skills.
Finally, I’ll leave my thoughts on the last problem, I hope someone can implement it: #7580.
Good luck!

@alamb
Copy link
Contributor

alamb commented Sep 17, 2023

@izveigor -- thank you for letting us know and for all your work so far in this project. I think it has driven DataFusion significantly and we have all benefited from your work and inspiration.

Good luck with your next adventure.

@edmondop
Copy link
Contributor

I propose to merge #6978 and then refactor both that pr and this #6981 to make easy to implement #6979 .

All these functions traverse arrays and use some accumulator, so we can create an accumulator trait and three instances (union, intersection, except). @alamb what do yout hink?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Dec 9, 2023

@Weijun-H
Copy link
Member

Weijun-H commented Jan 27, 2024

@alamb
Copy link
Contributor

alamb commented Jul 8, 2024

This looks pretty good to me now, I think we have done all the major items so closing this epic as good. We can handle small cleanup items as normal tickets. Please let me know if you disagree

@alamb alamb closed this as completed Jul 8, 2024
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants