Skip to content

Add SQL functions ARRAY_DUPES and ARRAY_HAS_DUPES#16317

Merged
rongrong merged 1 commit intoprestodb:masterfrom
pranjalssh:add_dupes
Jul 8, 2021
Merged

Add SQL functions ARRAY_DUPES and ARRAY_HAS_DUPES#16317
rongrong merged 1 commit intoprestodb:masterfrom
pranjalssh:add_dupes

Conversation

@pranjalssh
Copy link
Copy Markdown
Contributor

@pranjalssh pranjalssh commented Jun 22, 2021

Added SQL functions:

ARRAY_DUPES: Returns elements that are duplicated in input
ARRAY_HAS_DUPES: Retruns boolean whether array has any duplicates
null is also considered a valid element and accounted for. This follows what array_distinct does.

Tracking Issue: #15656

== RELEASE NOTES ==

General Changes
* Add SQL functions :func:`array_dupes` and :func:`array_as_dupes`.

@rongrong rongrong requested a review from kaikalur June 22, 2021 17:39
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'repeated only once' is confusing. Just say the result is a set of elements that occur more than once in the original array.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to repeat for every type. Just make it ARRAY like we do for other functions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because that's the only 2 versions of the functions we spell out. It's probably nice to support generic types for these builtin SQL functions at least.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Copy Markdown
Contributor

@kaikalur kaikalur Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have array_frequency. If so just use that and do something like:

map_keys(map_filter(array_frequency(input), (k, v) -> v > 1))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check, we might not be allowing using sql functions inside sql functions. We probably don't need to be this strict. It should be more like not allowing dynamic functions in other dynamic functions (I don't remember what we were actually checking)

@kaikalur
Copy link
Copy Markdown
Contributor

Also I wonder if we should rename this file to ArrayFunctions or something, it's not just arithmetic anymore.

@pranjalssh pranjalssh force-pushed the add_dupes branch 2 times, most recently from 036f58e to c931df0 Compare June 23, 2021 12:42
@pranjalssh
Copy link
Copy Markdown
Contributor Author

Addressed comments

Copy link
Copy Markdown
Contributor

@rongrong rongrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @kaikalur suggested, let's change the class ArrayArithmeticFunctions to ArraySqlFunctions maybe.

@pranjalssh pranjalssh requested a review from kaikalur June 24, 2021 08:32
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say that element type of x has to be coercible to bigint or varchar and the return type will be bigint / varchar? Please refer to the documentation of array_sum.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say that element type of x has to be coercible to bigint or varchar and the return type will be bigint / varchar? Please refer to the documentation of array_sum.

Hmm, why only those two types? It should be applicable to any comparable type right?

@pranjalssh pranjalssh force-pushed the add_dupes branch 3 times, most recently from 101e881 to 4c74c1a Compare June 30, 2021 12:04
@rongrong rongrong merged commit ea8d267 into prestodb:master Jul 8, 2021
@mbasmanova
Copy link
Copy Markdown
Contributor

Folks, array_dupes name is inconsistent with Presto function names. A more consistent name would be array_duplicates. Let's fix it before it is too late.

CC: @rongrong @kaikalur

@mbasmanova
Copy link
Copy Markdown
Contributor

CC: @tdcmeehan @highker

@rongrong
Copy link
Copy Markdown
Contributor

These functions are already used in production. Do you want to add aliases for them?

@mbasmanova
Copy link
Copy Markdown
Contributor

These functions are already used in production. Do you want to add aliases for them?

Perhaps, we can add aliases, then update prod queries, then delete original names. Are there a lot of these?

@biswapesh
Copy link
Copy Markdown

Do we have other examples of Presto functions where we use abbreviated names? Or other SQL engines that use these names? Would be nice to be consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants