Skip to content

Make array_distinct compare elements with "is distinct"#11979

Closed
findepi wants to merge 1 commit intoprestodb:masterfrom
findepi:findepi/array-distinct
Closed

Make array_distinct compare elements with "is distinct"#11979
findepi wants to merge 1 commit intoprestodb:masterfrom
findepi:findepi/array-distinct

Conversation

@findepi
Copy link
Contributor

@findepi findepi commented Nov 26, 2018

Fixes #11963, fixes #11977

@dain
Copy link
Contributor

dain commented Nov 26, 2018

@findepi, I'd like to see @haozhun review this. Specifically the method handle bits, and how they relate to the new calling conventions.

Related to that, I am wondering if we can adapt the TypeSet code to alway work on an "equalitator" function based on block and position. I think we can adapt both version to that style.

@findepi
Copy link
Contributor Author

findepi commented Nov 26, 2018

I think, but i may be wrong, that @OperatorDependency doesn't support elasticity around calling convention. @haozhun am i right that operators provided this way must have single calling convention?

Related to that, I am wondering if we can adapt the TypeSet code to alway work on an "equalitator" function based on block and position.

That should be possible. Currently the problem is that equality is like == and "is distinct" is like !=. Thus one would need to wrap the MethodHandles in some further BiPredicate or something.
Before going this direction, it would be good to know if this is needed. TypedSet is currently used in:

  • array_distinct
  • array_except
  • array_union
  • map_from_entries
  • map_concat
  • cosine_similarity
  • multimap_agg
  • multimap_from_entries
  • MapToMapCast

Do you know which of these should use "equals" and which should use "distinctness"?

@dain
Copy link
Contributor

dain commented Nov 26, 2018

@findepi I am no expert on these functions, but to me I would guess all of them should be using is not distinct from over =.

@haozhun
Copy link
Contributor

haozhun commented Nov 26, 2018

@gerlou did the majority of work for support of elasticity around calling convention for @OperatorDependency. @cemcayiroglu completed a few additional pieces.

I'll take over #11263 and wrap it up. With that, distinct from operator will have two conventions implemented, and when annotating a parameter with @OperatorDependency, one can declare which one is needed. (PR up: #11983)

@findepi
Copy link
Contributor Author

findepi commented Nov 27, 2018

@haozhun i will wait for #11983 then. Your PR is definitely bigger and harder to rebase.

@dain i will then extend this PR to cover all usages of TypedSet (affecting said functions).

assertFunction("ARRAY_DISTINCT(ARRAY [NULL, NULL])", new ArrayType(UNKNOWN), asList((Object) null));
assertFunction("ARRAY_DISTINCT(ARRAY [NULL, NULL, NULL])", new ArrayType(UNKNOWN), asList((Object) null));

// Indeterminate values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the test cases mentioned in #11977 as well?

@yingsu00
Copy link
Contributor

@dain @haozhun @findepi correct me if I'm wrong. I don't quite know the history of TypedSet and the isDistinctFrom operator, but it looks to me that TypedSet was trying to support the "distinctness" semantics. This can be seen from how it treats null: It has special slot in hashtable for NULL, and a flag indicating the existence of NULL, and also added guard to check NULL values before calling type.equalTo() in positionEqualsPosition() (this guard takes the distinctness semantics instead of the equal to semantics). So I would think replacing the type.equalTo with isDistinctFrom (and remove the equalTo() call) is sufficient enough, unless there is future use case that requires the "equalTo" semantics.

Another thought is: the wrong results issue (#11977) was caused by the (array.getPositionCount() == 2) optimization only. It called type.equalTo() without checking on null in front. The callsite of positionEqualsPosition() (which calls type.equalTo()) inside of TypedSet.getHashPositionOfElement() shouldn't be causing correctness issues, because it would throw "not supported" error if it sees NULL in composite data types, and for primitive types it has guard to check null in front. So if we want to fix the correctness issue before #11983 is merged, we can just add null check for that special case, or simply delete the if (array.getPositionCount() == 2) block. I can send PR for it if people think we want a quicker fix on the correctness issue.

@findepi
Copy link
Contributor Author

findepi commented Nov 28, 2018

it looks to me that TypedSet was trying to support the "distinctness" semantics.

@yingsu00 it looks so. I listed all usages above (#11979 (comment)) and the plan is to make the class work in "distinctness" semantics always.

or simply delete the if (array.getPositionCount() == 2) block. I can send PR for it if people think we want a quicker fix on the correctness issue.

As you observed, this solves one known issue but not the other. (And would probably fix array_distinct only.)
Fortunately we have fix ready: #11983 + this. If there is a time pressure, we can defer extending the fix to other function types.

@findepi
Copy link
Contributor Author

findepi commented Mar 28, 2019

Superseded by trinodb/trino#559

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ARRAY_DISTINCT returns wrong results Make Array_distinct to support rows with NULL

5 participants