Skip to content

Add array_contains_all function#23202

Open
jainavi17 wants to merge 1 commit intoprestodb:masterfrom
jainavi17:array_contains_array
Open

Add array_contains_all function#23202
jainavi17 wants to merge 1 commit intoprestodb:masterfrom
jainavi17:array_contains_array

Conversation

@jainavi17
Copy link
Contributor

@jainavi17 jainavi17 commented Jul 13, 2024

Core function to return whether all elements of a second array are present in the first array

Test plan

Added unit tests.
Build successfully using the following terminal command

  • mvn clean install -Dtest=TestArrayOperators -Dmaven.javadoc.skip=true -DskipUI -T1C -fn -pl presto-main
== RELEASE NOTES ==

General Changes
- Add function :func:`array_contains_array`

@jainavi17 jainavi17 requested review from a team, elharo and steveburnett as code owners July 13, 2024 07:01
@jainavi17 jainavi17 requested a review from presto-oss July 13, 2024 07:01
Copy link
Contributor

@elharo elharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good. I think this needs an issue, and perhaps an RFC.

Also if it's in the code is it a UDF or just a function?

@kaikalur
Copy link
Contributor

I suggest array_contains_all as the name

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc! Nit only.

@jainavi17 jainavi17 force-pushed the array_contains_array branch from 50d6889 to 62668f4 Compare August 15, 2024 09:33
steveburnett
steveburnett previously approved these changes Aug 15, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build, docs look good. Thanks!

@elharo elharo changed the title Add array_contains_array function Add array_contains_all function Aug 15, 2024
@jainavi17 jainavi17 force-pushed the array_contains_array branch 2 times, most recently from 468b26a to f86b0f8 Compare August 22, 2024 08:08
elharo
elharo previously approved these changes Aug 22, 2024
@SqlType("array(T)") Block firstArray,
@SqlType("array(T)") Block secondArray)
{
TypedSet firstSet = new TypedSet(elementType, firstArray.getPositionCount(), "arrayContainsAll");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you benchmark this version vs. using OptimizedTypeSet and checking the cardinality of the intersection of the two sets? I expect that using OptimizedTypedSet would be faster except for the case where a large second array is able to short circuit very early.

Copy link
Contributor

@kaikalur kaikalur Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually curious - how about CONTAINS? @rschlussel did you fix that as well to be IS DISTINCT FROM? This function should behave like contains for that part

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contains fails for all null values. If we want this to fail for all nulls, we can do that as well, but we shouldn't have failures only for nulls nested in complex types, but otherwise compare nulls as equal (which is what typedset does by default)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainavi17 did you have a chance to benchmark vs. a version using OptimizedTypeSet?

@elharo
Copy link
Contributor

elharo commented Aug 22, 2024

FWIW, if anything, this function does not exist in Spark SQL, so adding this function increases the divergence between Presto SQL and Spark SQL. I don't know how much weight that carries here, though I do know some people care about this

@kaikalur
Copy link
Contributor

FWIW, if anything, this function does not exist in Spark SQL, so adding this function increases the divergence between Presto SQL and Spark SQL. I don't know how much weight that carries here, though I do know some people care about this

That's not a concern especially because adding udfs to Spark is easier

@aditi-pandit
Copy link
Contributor

aditi-pandit commented Aug 23, 2024

@jainavi17 : Please can you create an issue to add this function for the Native engine.

@jainavi17
Copy link
Contributor Author

jainavi17 commented Sep 4, 2024

@jainavi17 : Please can you create an issue to add this function for the Native engine.

Native engine as in?

@jainavi17 jainavi17 force-pushed the array_contains_array branch 2 times, most recently from d209068 to a411156 Compare September 4, 2024 17:02
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit of formatting.

Scalar function that takes two arrays an input and checks it all elements of right array are present in left array
@jainavi17 jainavi17 force-pushed the array_contains_array branch from a411156 to 8311dd3 Compare September 4, 2024 17:17
@jainavi17
Copy link
Contributor Author

Hi @NikhilCollooru , if you could review the PR please.

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build. Thanks!

@jainavi17
Copy link
Contributor Author

@elharo if you could take a look at the changes please.

@SqlType("array(T)") Block firstArray,
@SqlType("array(T)") Block secondArray)
{
TypedSet firstSet = new TypedSet(elementType, firstArray.getPositionCount(), "arrayContainsAll");
Copy link
Contributor

@rschlussel rschlussel Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use distinct semantics here so that we don't throw an exception for arrays of arrays with nulls? You pass in the "distinct" function handle to the typedSet. You can follow how it's done for array_distinct here: You can use "distinct comparison" for the typedset following what we do for array_distinct see

TypedSet typedSet = new TypedSet(type, elementIsDistinctFrom, array.getPositionCount(), "array_distinct");

@SqlType("array(T)") Block firstArray,
@SqlType("array(T)") Block secondArray)
{
TypedSet firstSet = new TypedSet(elementType, firstArray.getPositionCount(), "arrayContainsAll");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainavi17 did you have a chance to benchmark vs. a version using OptimizedTypeSet?

}

@Test
public void testNulls()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test for array of arrays/rows containing nulls

@rschlussel
Copy link
Contributor

@jainavi17 : Please can you create an issue to add this function for the Native engine.

Native engine as in?

native engine means that we need a corresponding c++ function for this as we are in the process of migrating from java workers to c++ workers.

@jainavi17
Copy link
Contributor Author

Hi @rschlussel I tried to do benchmark test but somehow my IDE is breaking and I'm unable to set up the benchmark test. Is there any way I can do it on command line? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants