Skip to content

Fix find_first function for NULL value#18952

Merged
feilong-liu merged 2 commits intoprestodb:masterfrom
feilong-liu:fix_find_first_for_null
Feb 11, 2023
Merged

Fix find_first function for NULL value#18952
feilong-liu merged 2 commits intoprestodb:masterfrom
feilong-liu:fix_find_first_for_null

Conversation

@feilong-liu
Copy link
Contributor

@feilong-liu feilong-liu commented Jan 20, 2023

What's the change

Fixes #18899

  • Ambiguous NULL return value for find_first function.

The find_first function returns NULL if no match found. However, it cannot distinguish from the NULL returned as values.

For example, both SELECT FIND_FIRST(ARRAY[NULL, 1], x->x is NULL); and SELECT FIND_FIRST(ARRAY[1], x->x is NULL); return NULL.

In this PR, the find_first function will throw exception if the returned match value is NULL.

Test plan - (Please fill in how you tested your changes)

Add unit tests

== RELEASE NOTES ==

General Changes
* Fix output of find_first function for NULL
   Now it will throw exception if the found matched value is NULL

@feilong-liu feilong-liu requested a review from a team as a code owner January 20, 2023 01:48
@feilong-liu feilong-liu marked this pull request as draft January 20, 2023 01:48
@feilong-liu feilong-liu force-pushed the fix_find_first_for_null branch 4 times, most recently from 7458199 to b805b0f Compare January 20, 2023 07:53
@feilong-liu feilong-liu requested a review from kaikalur January 20, 2023 16:09
@feilong-liu feilong-liu marked this pull request as ready for review January 20, 2023 16:09
@v-jizhang
Copy link
Contributor

@prestobot kick off tests

@feilong-liu feilong-liu requested review from kaikalur and removed request for kaikalur January 23, 2023 18:48
@kaikalur
Copy link
Contributor

I think we should have 2 separate PRs - one for the bug fix and the other for the new udf?

@feilong-liu
Copy link
Contributor Author

feilong-liu commented Jan 24, 2023

I think we should have 2 separate PRs - one for the bug fix and the other for the new udf?

Because I wanted to encourage people to use the new UDF find_first_index in the exception message thrown in find_first when NULL value is return as matched. The only concern I have is that if two PRs somehow end up not in the same release, it will cause some confusion.

But I guess that the probability of this happening is quite low? I can split it into two PRs for ease of review if this is not a problem.

@feilong-liu feilong-liu force-pushed the fix_find_first_for_null branch from b805b0f to 750cf3f Compare January 24, 2023 18:54
@feilong-liu feilong-liu changed the title Fix find_first function for NULL value and add find_first_index function Fix find_first function for NULL value Jan 24, 2023
Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is null not supported?

@feilong-liu
Copy link
Contributor Author

why is null not supported?

It's because returned NULL can be either 1) no match is found, and NULL is returned (for example FIND_FIRST(ARRAY[1], x->x is NULL)) or 2) match found, whose value is NULL (for example FIND_FIRST(ARRAY[NULL, 1], x->x is NULL))
In order to avoid this confusion, we throw exception in the second case, and encourage user to use the other UDF find_first_index added in #18962

@feilong-liu feilong-liu force-pushed the fix_find_first_for_null branch from 750cf3f to 5cdd6e5 Compare January 27, 2023 00:14
@rschlussel
Copy link
Contributor

why is null not supported?

It's because returned NULL can be either 1) no match is found, and NULL is returned (for example FIND_FIRST(ARRAY[1], x->x is NULL)) or 2) match found, whose value is NULL (for example FIND_FIRST(ARRAY[NULL, 1], x->x is NULL)) In order to avoid this confusion, we throw exception in the second case, and encourage user to use the other UDF find_first_index added in #18962

Is that what we want? As long as it's documented, i think having null represent both is okay and not unusual or surprising (e.g. map.get() in java is the same way). And people may prefer their queries not break when there are null values in the map.

@kaikalur
Copy link
Contributor

kaikalur commented Jan 27, 2023

why is null not supported?

It's because returned NULL can be either 1) no match is found, and NULL is returned (for example FIND_FIRST(ARRAY[1], x->x is NULL)) or 2) match found, whose value is NULL (for example FIND_FIRST(ARRAY[NULL, 1], x->x is NULL)) In order to avoid this confusion, we throw exception in the second case, and encourage user to use the other UDF find_first_index added in #18962

Is that what we want? As long as it's documented, i think having null represent both is okay and not unusual or surprising (e.g. map.get() in java is the same way). And people may prefer their queries not break when there are null values in the map.

Main issue is this is using a predicate to get the first element that satisfies the predicate. It's not the just default exists or not exists. So for example if I want to find the first element that's is null, it's actually wrong result (not just confusing). And most our arrays dont have null so I think throwing this error is OK

@rschlussel
Copy link
Contributor

Main issue is this is using a predicate to get the first element that satisfies the predicate. It's not the just default exists or not exists. So for example if I want to find the first element that's is null, it's actually wrong result (not just confusing). And most our arrays dont have null so I think throwing this error is OK

Then why not skip nulls for this function and have it return the first non-null?

@kaikalur
Copy link
Contributor

Main issue is this is using a predicate to get the first element that satisfies the predicate. It's not the just default exists or not exists. So for example if I want to find the first element that's is null, it's actually wrong result (not just confusing). And most our arrays dont have null so I think throwing this error is OK

Then why not skip nulls for this function and have it return the first non-null?

Then x->x is NULL will return false which is wrong.

@rschlussel
Copy link
Contributor

I guess i'm confused how find_first_index solves the problem of people wanting to find first non-null. Why don't we have a find_first_non_null function if that's needed?

@kaikalur
Copy link
Contributor

I guess i'm confused how find_first_index solves the problem of people wanting to find first non-null. Why don't we have a find_first_non_null function if that's needed?

if (find_first_index(arr, x->x IS NULL)

will return NULL if no element is NULL or the index of the first NULL element

@rschlussel
Copy link
Contributor

Are we looking for the first null element or the first non-null element? I don't have a problem with the find_first_index function, but I'm not sure we should be failing for this function. Can we take a step back and list the different use cases we're trying to cover with these functions.

@kaikalur
Copy link
Contributor

kaikalur commented Jan 27, 2023

Are we looking for the first null element or the first non-null element? I don't have a problem with the find_first_index function, but I'm not sure we should be failing for this function. Can we take a step back and list the different use cases we're trying to cover with these functions.

In general if the predicate is sensitive to NULL (like IS NULL), it's a problem. The problem we are trying to solve is that find_fist returning NULL can mean the element satisfying the predicate is NULL or no element satisfies the predicate. Since we can't distinguish this, we will now add a new function that simply returns the index of the first element satisying the predicate (or NULL if no such element exists).

FIND_FIRST(ARRAY[1], x->x is NULL))  -- match found, whose value is NULL (
FIND_FIRST(ARRAY[NULL, 1], x->x is NULL)) Also returns null but it's ambiguous

So we now throw error for second example. The right way to do this is:

FIND_FIRST_INDEX(my_arr, x->x IS NULL)

Will be NULL if no such element exists. Or the index (for example 1 in the second call above) where a NULL element exists.

This is no different from the runtime errors we give for things like comparing ROW elements that have NULLs etc. Also this is better because it avoids confusion.

Also we now have ARRAY_REMOVE_NULLS function which the user can use to filter the input if they want to apply the predicate to only non-NULL elements.

@rschlussel
Copy link
Contributor

why is find_first_index + then getting that index from the array better than contains() and find_first() used in combo if you're worried about nulls?
I guess performance-wise it's slightly better because you only have to evaluate the function/search the array once, and after that you access directly by index.

@kaikalur
Copy link
Contributor

It's more about usability and not falling into traps. I have had this experience when I implemented SET_AGG. So I guess one could argue the find_first_index is probably not needed but it will be nice to have. But I think throwing error for NULL is the right thing

@feilong-liu
Copy link
Contributor Author

feilong-liu commented Jan 30, 2023

@rschlussel

Are we looking for the first null element or the first non-null element? I don't have a problem with the find_first_index function, but I'm not sure we should be failing for this function. Can we take a step back and list the different use cases we're trying to cover with these functions.

Let me compare three versions of the first_first function, and use the following naming for ease of comparison:

  • find_first_original: the current find_first function without fix
  • find_first_fix: the find_first function with the fix in this PR
  • find_first_non_null: find first non-null element

And let's compare the following cases

  1. Input array has NO NULL elements
    No difference
  2. Input array has NULL elements and NULL does NOT match the predicate
    No difference
  3. Input array has NULL elements and NULL does match the predicate, and NULL is NOT the first matching element
    No difference, all three versions will return the first match elements
  4. Input array has NULL elements and NULL does match the predicate, and NULL is the first matching element
    Is different. find_first_original will return NULL, find_first_fix will throw exception, find_first_non_null will return the next matching element (if any).
    For example, find_first_original(array[NULL, 1], x -> true) returns NULL, find_first_fix(array[NULL, 1], x -> true) throws exception, find_first_non_null(array[NULL, 1], x -> true) returns 1.

The first thing we should agree on is that, the current version, i.e. find_first_original should be fixed, as the returned match value NULL is not differentiable from the returned NULL which means no match element is found.

The following question is then, how do we fix this problem. I've been thinking on different fixes, and ends up that throwing exceptions is the desirable way. This is because 1) it makes minimal change to the semantics of the existing find_first function. 2) throwing exception and pointing people to the alternative of find_first_index (whether to educate users on the possible solution and which solution to use can be a separate problem, and let me update the error message) is a safe way to avoid silent correctness issues.

@feilong-liu
Copy link
Contributor Author

I guess i'm confused how find_first_index solves the problem of people wanting to find first non-null. Why don't we have a find_first_non_null function if that's needed?

This PR is not to "solves the problem of people wanting to find first non-null" (and it can already be solved by existing find_first function, e.g. find_first(array, x->x is not null)). It's about to solve the ambiguous case where NULL value is returned both when no match found and NULL matching value is found.

Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree that it's a problem for the null return value to be ambiguous. But I don't feel strongly about it, so approving since others think it makes sense.

Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the test failure first. requesting changes so it doesn't accidentally get merged.

@feilong-liu feilong-liu force-pushed the fix_find_first_for_null branch 3 times, most recently from d14ca92 to 6ca1617 Compare January 31, 2023 01:46
@feilong-liu feilong-liu force-pushed the fix_find_first_for_null branch from 6ca1617 to 2e95c57 Compare February 2, 2023 21:17
@feilong-liu feilong-liu force-pushed the fix_find_first_for_null branch from 2e95c57 to 5a8fcb1 Compare February 7, 2023 17:32
Remane from ArrayFindFirstFirstFunction.java to ArrayFindFirstFunction.java
Currently find_first return NULL when no matched element is found.
However it can also return NULL elements when the element satisfy the
provided condition. In order to distinguish these two cases, the
find_first function will throw exception if the returned matched value
is NULL.
@feilong-liu feilong-liu force-pushed the fix_find_first_for_null branch from 5a8fcb1 to 7bf7037 Compare February 11, 2023 00:18
@feilong-liu feilong-liu merged commit ea617d1 into prestodb:master Feb 11, 2023
@wanglinsong wanglinsong mentioned this pull request Feb 25, 2023
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

find_first UDF cannot distinguish between NULL returned as a value and NULL returned because of no match

4 participants