-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25417][SQL] ArrayContains function may return incorrect result when right expression is implicitly down casted #22408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #96009 has finished for PR 22408 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you revert this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Sure.. sorry.. didn't notice it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could you check the error message as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin This fails since 1.23 is parsed as decimal(3, 2). And findTightestCommonType can not find a common type between integer and decimal(3,2). I specifically added this case to draw it to your attention to see if we are okay with this behaviour :-). If we had it as 1.23D then we would have been able to find a common type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Did you want me to check the error message for all the cases ? or just a couple of them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the behavior, we might be better to use findWiderTypeWithoutStringPromotion instead of findTightestCommonType.
And as for checking the error message, just those you added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Thank you. I also thought of using findWiderTypeWithoutStringPromotion but later changed it to findTightestCommonType. One question i had was, can findWiderTypeWithoutStringPromotion do a lossy conversion. From the class description it seems only findTightestCommonType does a absolute safe casting ? Please let me know what you think ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin FYI - i just pushed the changes addressing your comments. I have kept findTightestCommonType as is for now. I will change it based on your input and adjust the tests accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Yes, I can do a lossy conversion.
Seems like BinaryComparison uses wider DecimalType, so we could follow the behavior to use findWiderTypeWithoutStringPromotion.
cc @gatorsmile @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if we can array_contains(array(1), '1') in 2.3 as #22408 (comment), we should use findWiderCommonType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Sure. We could use findWiderCommonType. My thinking was, since we are injecting this cast implicitly, we should pick the safest cast so we don't see data dependent surprises. Users could always specify an explicit cast and take the the responsibility of the result :-)
However, i don't have a strong opinion. I will change it use findWiderCommonType
|
Test build #96017 has finished for PR 22408 at commit
|
|
Test build #96036 has finished for PR 22408 at commit
|
|
Test build #96031 has finished for PR 22408 at commit
|
|
retest this please |
|
Test build #96044 has finished for PR 22408 at commit
|
|
we need to update the migration guide? e.g., |
|
@maropu I thought we added this function in 2.4 :-). Yeah.. i will update the migration guide. |
|
How about this decimal case? |
|
@maropu This is the case we were discussing for which @ueshin suggested using |
|
What is the corresponding behavior in Presto? |
|
My general idea is to avoid risky implicit type casting at the beginning. We can relax it in the future, if needed. After all, users can manually cast the types after seeing the reasonable error message. This should not be a big deal. However, returning a confusing result due to implicit type casting is not good in general. |
|
@ushin @gatorsmile Here are the results from presto. Please let me know if you want me try any case in particular. One thing to note is that presto allows comparison between int and decimal. In our presto:default> select contains(array[1,2,3], '1');
Query 20180914_053612_00006_pru6h failed: line 1:8: Unexpected parameters (array(integer), varchar(1)) for function contains. Expected: contains(array(T), T) T:comparable
select contains(array[1,2,3], '1')
presto:default> select contains(array[1,2,3], 'foo');
Query 20180914_053729_00007_pru6h failed: line 1:8: Unexpected parameters (array(integer), varchar(3)) for function contains. Expected: contains(array(T), T) T:comparable
select contains(array[1,2,3], 'foo')
presto:default> select contains(array['1','2','3'], 1);
Query 20180914_053850_00008_pru6h failed: line 1:8: Unexpected parameters (array(varchar(1)), integer) for function contains. Expected: contains(array(T), T) T:comparable
select contains(array['1','2','3'], 1)
presto:default> select contains(array[1,2,3], cast(1.0 as decimal(10,2)));
_col0
-------
true
(1 row)
presto:default> select contains(array[1,2,3], cast(1.0 as double));
_col0
-------
true
(1 row) |
|
what does presto return for |
|
@cloud-fan It returns false. |
|
@cloud-fan For the above case, from the plan, it seems like presto convert both sides to decimal(12,2) presto:default> explain select contains(array[1], 1.34);
Query Plan
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- Output[_col0] => [contains:boolean]
Cost: {rows: 1 (10B), cpu: 10.00, memory: 0.00, network: 0.00}
_col0 := contains
- Project[] => [contains:boolean]
Cost: {rows: 1 (10B), cpu: 10.00, memory: 0.00, network: 0.00}
contains := "contains"(CAST("$literal$array(integer)"("from_base64"('CQAAAElOVF9BUlJBWQEAAAAAAQAAAA==')) AS array(decimal(12,2))), CAST(DECIMAL '1.34' AS decimal
- LocalExchange[ROUND_ROBIN] () =>
Cost: {rows: 1 (0B), cpu: 0.00, memory: 0.00, network: 0.00}
- Values => []
Cost: {rows: 1 (0B), cpu: 0.00, memory: 0.00, network: 0.00}
() |
|
LGTM. I think the last piece is the migration guide, to explain what changed from 2.3 to 2.4 |
|
@cloud-fan Thanks a lot. Let me take a stab at the migration guide changes. |
|
@ueshin @cloud-fan @gatorsmile Can you please look at the doc changes when you get a chance.. Thanks. |
|
Test build #96126 has finished for PR 22408 at commit
|
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hhmm, I think we can convert int to DecimalType(10, 0) without losing precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Yes..Actually, i think, the logic to handle int and decimal promotion can be improved. I like presto's behaviour here -
int, decimal(3, 2) => decimal(12, 2)
int, decimal(4, 3) => decimal(13, 3)
int, decimal(11, 3) => decimal(14, 3)
Let me know what you think..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I have taken a stab at fixing this at https://github.com/dilipbiswal/spark/tree/find_tightest
May i request you to please take a look ? I have run all the tests and they look ok. If you are okay, i will open a PR for this and once that gets in, we can look at this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments. Please send a PR, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a bug in the findTightestCommonType. For an int and a decimal, findTightestCommonType can return a value if the decimal's precision is bigger than int. But it can't return a value if the decimal's precision is small.
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this example? or pick a different one that can work. Basically it should work if we fix the bug with #22448
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan OK.
… right expression is implicitly down casted
d2d8afe to
df5ea47
Compare
docs/sql-programming-guide.md
Outdated
| <b>false</b> | ||
| </th> | ||
| <th> | ||
| <b>In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively.</b> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove both.
| <b>true</b> | ||
| </th> | ||
| <th> | ||
| <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can promote int to string, but I'm not sure that's a common behavior in other databases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If presto doesn't do it, we should follow it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Yeah, presto gives error. Please refer to my earlier comment showing the presto output. Did you want anything to change in the description ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah then it's fine, we don't need to change anything here.
| ) | ||
|
|
||
| checkAnswer( | ||
| df.selectExpr("array_contains(array(1), 1.23D)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this query doesn't read any data from df, so the 2 result rows are always same. Can we use OneRowRelation here?
| ) | ||
|
|
||
| checkAnswer( | ||
| df.selectExpr("array_contains(array(array(1)), array(1.23))"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm? shouldn't this fail because of the bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Yes. it should :-) I think i had changed this test case to verify the fix to tighestCommonType.. and pushed it by mistake. Sorry about it.
|
LGTM |
|
Test build #96322 has finished for PR 22408 at commit
|
|
Test build #96329 has finished for PR 22408 at commit
|
|
retest this please |
|
Test build #96333 has finished for PR 22408 at commit
|
|
thanks, merging to master/2.4! |
… when right expression is implicitly down casted ## What changes were proposed in this pull request? In ArrayContains, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result. Example : ```SQL spark-sql> select array_contains(array(1), 1.34); true ``` ```SQL spark-sql> select array_contains(array(1), 'foo'); null ``` We should safely coerce both left and right hand side expressions. ## How was this patch tested? Added tests in DataFrameFunctionsSuite Closes #22408 from dilipbiswal/SPARK-25417. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 67f2c6a) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In ArrayContains, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result.
Example :
We should safely coerce both left and right hand side expressions.
How was this patch tested?
Added tests in DataFrameFunctionsSuite