Skip to content

Use distinct semantics for distincting array functions#22938

Merged
rschlussel merged 6 commits intoprestodb:masterfrom
rschlussel:distinct-comparisons
Jun 25, 2024
Merged

Use distinct semantics for distincting array functions#22938
rschlussel merged 6 commits intoprestodb:masterfrom
rschlussel:distinct-comparisons

Conversation

@rschlussel
Copy link
Contributor

@rschlussel rschlussel commented Jun 6, 2024

Description

change the following functions to use distinct semantics for comparison rather than equal to

  • array_distinct
  • array_intersect
  • array_except
  • array_union
  • set_agg
  • set_union

I did not change arrays_overlap as part of this PR because it has its own semantics where it returns null for any null input (but gives an error for null inside of rows), and so we probably need to think more about what behavior we want from that function.

Motivation and Context

Previously queries would fail with an error " ROW comparison not supported for fields with null elements" when there was an array of rows where any field was null or an array of arrays where any element was null. However, there would be no failure if nulls were included more directly (e.g. an array with nulls rather than an array of arrays with nulls). These functions are distincting functions, and so should be using full distinct semantics.

Fixes #22900

Impact

The above functions will no longer throw an error for nulls anywhere in the input. The functions will be using distinct semantics for comparison

Test Plan

unit tests

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Fix an issue where ``array_distinct``,``array_except``, ``array_intersect``, ``array_union``,  ``set_union``, and ``set_agg`` would return an error if the input was an array of rows with null fields or array of arrays with null elements.  These functions now use full "IS DISTINCT FROM" semantics for comparison.

@rschlussel rschlussel changed the title Distinct comparisons Use distinct semantics for distincting array functions Jun 6, 2024
@rschlussel rschlussel force-pushed the distinct-comparisons branch 2 times, most recently from 48599aa to 6f300e5 Compare June 6, 2024 20:24
@rschlussel rschlussel marked this pull request as ready for review June 7, 2024 14:15
@rschlussel rschlussel requested review from a team, feilong-liu and jaystarshot as code owners June 7, 2024 14:15
@rschlussel rschlussel requested a review from presto-oss June 7, 2024 14:15
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to update documentation to clarify these semantics.

@rschlussel rschlussel force-pushed the distinct-comparisons branch from 6f300e5 to f616b1f Compare June 11, 2024 16:31
@rschlussel rschlussel requested a review from steveburnett as a code owner June 11, 2024 16:31
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc! Only a few nits, looks good.

@rschlussel rschlussel force-pushed the distinct-comparisons branch from f616b1f to 8ee28ba Compare June 11, 2024 18:29
@rschlussel
Copy link
Contributor Author

Thanks for the review @steveburnett. I've addressed your comments.

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc update! Two, I think final, nits.

@rschlussel rschlussel force-pushed the distinct-comparisons branch from 8ee28ba to ae1f483 Compare June 11, 2024 19:12
steveburnett
steveburnett previously approved these changes Jun 11, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local docs build, everything looks good. Thanks!

arhimondr
arhimondr previously approved these changes Jun 12, 2024
Copy link
Member

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % questions, nits

this.blockBuilders[channelIndex] = type.createBlockBuilder(null, EXPECTED_BLOCK_BUILDER_SIZE);
this.valueSets[channelIndex] = new TypedSet(
type,
Optional.empty(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever collect dynamic filters for complex types (e.g.: rows?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but dynamic filter needs to match join semantics, and joins fail on indeterminate arrays/rows, so dynamic filter should do the same.

examples:

presto> with input as (SELECT * FROM (VALUES (ARRAY[null], 2), (ARRAY[null], 3))t(x, y)) SELECT * from input i1 join input i2 on i1.x = i2.x;
                    

Query 20240614_131341_75681_9w2fv failed: ARRAY comparison not supported for arrays with null elements

presto> with input as (SELECT * FROM (VALUES (ROW(null), 2), (ROW(null), 3))t(x, y)) SELECT * from input i1 join input i2 on i1.x = i2.x;

Query 20240614_131414_75748_9w2fv failed: ROW comparison not supported for fields with null elements

private static final SelectedPositions EMPTY_SELECTED_POSITIONS = positionsList(new int[0], 0, 0);

private final Type elementType;
private final Optional<MethodHandle> elementIsDistinctFrom;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How difficult would it be to make it required and always use isDistinctFrom?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do it because this is also used for MapConcatFunction, and maps don't support indeterminate types, but it's pretty easy to change, and the maps should fail later anyway. I can make this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, map_concat doesn't otherwise check that map keys are not indeterminate or null (caused failing tests because of this), and it's a bit challenging to get it to do that because it's a varargs function. I'm going to put this back to the way it was, and look into improving it as a follow up.

MethodHandle elementIsDistinctFrom)
{
this.arrayType = new ArrayType(elementType);
this.elementIsDistinctFrom = elementIsDistinctFrom;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: requireNonNull

private final MethodHandle elementIsDistinctFrom;

public SetAggregationStateSerializer(@TypeParameter("T") Type elementType)
public SetAggregationStateSerializer(Type elementType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line

@rschlussel rschlussel dismissed stale reviews from arhimondr and steveburnett via 4d967ca June 14, 2024 14:08
@rschlussel rschlussel force-pushed the distinct-comparisons branch from ae1f483 to 4d967ca Compare June 14, 2024 14:08
arhimondr
arhimondr previously approved these changes Jun 14, 2024
arhimondr
arhimondr previously approved these changes Jun 14, 2024
@kaikalur
Copy link
Contributor

We need a session param - this change is extensive and subtle NULL semantics that people might be depending on. So we need a way to use old semantics.

@rschlussel
Copy link
Contributor Author

We need a session param - this change is extensive and subtle NULL semantics that people might be depending on. So we need a way to use old semantics.

Makes sense. I'll add a property The previous semantics were pretty weird, so I'll make the property true by default. (Previously, regular null values in the input were fine and would be deduplicated, but nulls nested inside an array or row would give an error)

@rschlussel
Copy link
Contributor Author

After further discussion with Sreeni, we agreed that a session property should not be necessary here, as the previous behavior was pretty strange and unlikely to be desired (the difference is previously we were failing for some cases of nulls in the input, and now we will handle those according to expected distinct semantics). there is nothing that was succeeding that will have different results.

rschlussel and others added 6 commits June 24, 2024 11:58
Use distinct semantics rather than equal to semantics for
array_distinct. This fixes an issue where arrays of rows or arrays that had
null fields/elements would fail with a message that comparison of rows
with null elements was not allowed.

Based on: trinodb/trino@d1a50dc
co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
Consolidate array_intersect tests into TestArrayIntersectFunction. Split
up the giant test into smaller tests. There are no changes to the test cases
themselves.
Makes it align with array_intersect and array_except so it's easy to
find where the tests are.
Use distinct semantics rather than equal to semantics for
array_intersect, array_except, and array_union. This fixes failures
where Arrays of rows that had null fields would fail with errors that
comparison of rows with null elements were not allowed
Use distinct semantics rather than equal to semantics for set_agg and
set_union.  This fixes an issue where arrays of rows or arrays that had
null fields/elements would fail with a message that comparison of rows
with null elements was not allowed.

This change required generating the set_agg and set_union function
directly rather than using annotation based generation because we needed
to add an argument to SetAggregationStateSerializer
Add information about how some functions handle inputs from rows with
null fields.  Also fix a couple formatting issues.
@rschlussel rschlussel force-pushed the distinct-comparisons branch from d09549d to 600d474 Compare June 24, 2024 15:59
@rschlussel rschlussel merged commit 435e177 into prestodb:master Jun 25, 2024
@tdcmeehan tdcmeehan mentioned this pull request Aug 23, 2024
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use distinct semantics for functions that do distincting

5 participants