Fix find_first function for NULL value#18952
Conversation
7458199 to
b805b0f
Compare
|
@prestobot kick off tests |
|
I think we should have 2 separate PRs - one for the bug fix and the other for the new udf? |
Because I wanted to encourage people to use the new UDF But I guess that the probability of this happening is quite low? I can split it into two PRs for ease of review if this is not a problem. |
b805b0f to
750cf3f
Compare
rschlussel
left a comment
There was a problem hiding this comment.
why is null not supported?
...main/src/main/java/com/facebook/presto/operator/scalar/ArrayFindFirstWithOffsetFunction.java
Outdated
Show resolved
Hide resolved
It's because returned NULL can be either 1) no match is found, and NULL is returned (for example |
750cf3f to
5cdd6e5
Compare
Is that what we want? As long as it's documented, i think having null represent both is okay and not unusual or surprising (e.g. map.get() in java is the same way). And people may prefer their queries not break when there are null values in the map. |
Main issue is this is using a predicate to get the first element that satisfies the predicate. It's not the just default exists or not exists. So for example if I want to find the first element that's is null, it's actually wrong result (not just confusing). And most our arrays dont have null so I think throwing this error is OK |
Then why not skip nulls for this function and have it return the first non-null? |
Then x->x is NULL will return false which is wrong. |
|
I guess i'm confused how find_first_index solves the problem of people wanting to find first non-null. Why don't we have a find_first_non_null function if that's needed? |
if (find_first_index(arr, x->x IS NULL) will return NULL if no element is NULL or the index of the first NULL element |
|
Are we looking for the first null element or the first non-null element? I don't have a problem with the find_first_index function, but I'm not sure we should be failing for this function. Can we take a step back and list the different use cases we're trying to cover with these functions. |
In general if the predicate is sensitive to NULL (like IS NULL), it's a problem. The problem we are trying to solve is that find_fist returning NULL can mean the element satisfying the predicate is NULL or no element satisfies the predicate. Since we can't distinguish this, we will now add a new function that simply returns the index of the first element satisying the predicate (or NULL if no such element exists). So we now throw error for second example. The right way to do this is: Will be NULL if no such element exists. Or the index (for example 1 in the second call above) where a NULL element exists. This is no different from the runtime errors we give for things like comparing ROW elements that have NULLs etc. Also this is better because it avoids confusion. Also we now have ARRAY_REMOVE_NULLS function which the user can use to filter the input if they want to apply the predicate to only non-NULL elements. |
|
why is find_first_index + then getting that index from the array better than contains() and find_first() used in combo if you're worried about nulls? |
|
It's more about usability and not falling into traps. I have had this experience when I implemented SET_AGG. So I guess one could argue the find_first_index is probably not needed but it will be nice to have. But I think throwing error for NULL is the right thing |
Let me compare three versions of the first_first function, and use the following naming for ease of comparison:
And let's compare the following cases
The first thing we should agree on is that, the current version, i.e. The following question is then, how do we fix this problem. I've been thinking on different fixes, and ends up that throwing exceptions is the desirable way. This is because 1) it makes minimal change to the semantics of the existing find_first function. 2) throwing exception |
This PR is not to "solves the problem of people wanting to find first non-null" (and it can already be solved by existing find_first function, e.g. |
rschlussel
left a comment
There was a problem hiding this comment.
I'm not sure I agree that it's a problem for the null return value to be ambiguous. But I don't feel strongly about it, so approving since others think it makes sense.
rschlussel
left a comment
There was a problem hiding this comment.
Fix the test failure first. requesting changes so it doesn't accidentally get merged.
d14ca92 to
6ca1617
Compare
6ca1617 to
2e95c57
Compare
2e95c57 to
5a8fcb1
Compare
Remane from ArrayFindFirstFirstFunction.java to ArrayFindFirstFunction.java
Currently find_first return NULL when no matched element is found. However it can also return NULL elements when the element satisfy the provided condition. In order to distinguish these two cases, the find_first function will throw exception if the returned matched value is NULL.
5a8fcb1 to
7bf7037
Compare
What's the change
Fixes #18899
The find_first function returns NULL if no match found. However, it cannot distinguish from the NULL returned as values.
For example, both
SELECT FIND_FIRST(ARRAY[NULL, 1], x->x is NULL);andSELECT FIND_FIRST(ARRAY[1], x->x is NULL);return NULL.In this PR, the find_first function will throw exception if the returned match value is NULL.
Test plan - (Please fill in how you tested your changes)
Add unit tests