[SPARK-25417][SQL] ArrayContains function may return incorrect result when right expression is implicitly down casted #22408

dilipbiswal · 2018-09-12T22:27:18Z

What changes were proposed in this pull request?

In ArrayContains, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result.

Example :

spark-sql> select array_contains(array(1), 1.34);
true

spark-sql> select array_contains(array(1), 'foo');
null

We should safely coerce both left and right hand side expressions.

How was this patch tested?

Added tests in DataFrameFunctionsSuite

SparkQA · 2018-09-13T01:39:42Z

Test build #96009 has finished for PR 22408 at commit 2c6e8dc.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-09-13T06:10:11Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

nit: can you revert this change?

@ueshin Sure.. sorry.. didn't notice it.

ueshin · 2018-09-13T06:30:12Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

Why does this fail?

Also could you check the error message as well?

@ueshin This fails since 1.23 is parsed as decimal(3, 2). And findTightestCommonType can not find a common type between integer and decimal(3,2). I specifically added this case to draw it to your attention to see if we are okay with this behaviour :-). If we had it as 1.23D then we would have been able to find a common type.

@ueshin Did you want me to check the error message for all the cases ? or just a couple of them ?

As for the behavior, we might be better to use findWiderTypeWithoutStringPromotion instead of findTightestCommonType.
And as for checking the error message, just those you added.

@ueshin Thank you. I also thought of using findWiderTypeWithoutStringPromotion but later changed it to findTightestCommonType. One question i had was, can findWiderTypeWithoutStringPromotion do a lossy conversion. From the class description it seems only findTightestCommonType does a absolute safe casting ? Please let me know what you think ..

@ueshin FYI - i just pushed the changes addressing your comments. I have kept findTightestCommonType as is for now. I will change it based on your input and adjust the tests accordingly.

Good point. Yes, I can do a lossy conversion.
Seems like BinaryComparison uses wider DecimalType, so we could follow the behavior to use findWiderTypeWithoutStringPromotion.
cc @gatorsmile @cloud-fan

Hmm, if we can array_contains(array(1), '1') in 2.3 as #22408 (comment), we should use findWiderCommonType?

@ueshin Sure. We could use findWiderCommonType. My thinking was, since we are injecting this cast implicitly, we should pick the safest cast so we don't see data dependent surprises. Users could always specify an explicit cast and take the the responsibility of the result :-)

However, i don't have a strong opinion. I will change it use findWiderCommonType

SparkQA · 2018-09-13T06:42:08Z

Test build #96017 has finished for PR 22408 at commit 4fb4c1a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-13T10:09:51Z

Test build #96036 has finished for PR 22408 at commit 04915fd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-13T11:40:25Z

Test build #96031 has finished for PR 22408 at commit 63f6ce5.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-09-13T15:33:59Z

retest this please

SparkQA · 2018-09-13T19:40:04Z

Test build #96044 has finished for PR 22408 at commit 04915fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-09-14T03:55:53Z

we need to update the migration guide? e.g.,

// this patch
scala> sql("select array_contains(array(1), '1')").show
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(array(1), '1')' due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array<int>, string].; line 1 pos 7;
'Project [unresolvedalias(array_contains(array(1), 1), None)]
+- OneRowRelation

// v2.3.1
scala> sql("select array_contains(array(1), '1')").show
+----------------------------------------+
|array_contains(array(1), CAST(1 AS INT))|
+----------------------------------------+
|                                    true|
+----------------------------------------+

dilipbiswal · 2018-09-14T04:26:08Z

@maropu I thought we added this function in 2.4 :-). Yeah.. i will update the migration guide.

maropu · 2018-09-14T05:05:39Z

How about this decimal case?

// v2.3.1
scala> spark.range(10).selectExpr("cast(id AS decimal(9, 0)) as value").selectExpr("array_contains(array(1, 2, 3), value)").show
+--------------------------------------------------+
|array_contains(array(1, 2, 3), CAST(value AS INT))|
+--------------------------------------------------+
|                                             false|
|                                              true|
|                                              true|
|                                              true|
|                                             false|
|                                             false|
|                                             false|
|                                             false|
|                                             false|
|                                             false|
+--------------------------------------------------+

// this patch
scala> spark.range(10).selectExpr("cast(id AS decimal(9, 0)) as value").selectExpr("array_contains(array(1, 2, 3), value)").show
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(array(1, 2, 3), `value`)' due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array<int>, decimal(9,0)].; line 1 pos 0;
'Project [unresolvedalias(array_contains(array(1, 2, 3), value#2), Some(<function1>))]
+- Project [cast(id#0L as decimal(9,0)) AS value#2]
   +- Range (0, 10, step=1, splits=Some(4))

dilipbiswal · 2018-09-14T05:15:51Z

@maropu This is the case we were discussing for which @ueshin suggested using findWiderTypeWithoutStringPromotion. Lets see what @cloud-fan and @gatorsmile think and we will do accordingly. We are picking a much restrictive cast (since we are injecting this implicitly).

gatorsmile · 2018-09-14T05:28:34Z

What is the corresponding behavior in Presto?

gatorsmile · 2018-09-14T05:36:18Z

My general idea is to avoid risky implicit type casting at the beginning. We can relax it in the future, if needed. After all, users can manually cast the types after seeing the reasonable error message. This should not be a big deal. However, returning a confusing result due to implicit type casting is not good in general.

dilipbiswal · 2018-09-14T05:44:45Z

@ushin @gatorsmile Here are the results from presto. Please let me know if you want me try any case in particular. One thing to note is that presto allows comparison between int and decimal. In our findTightestCommonType we don't do the promotion.

presto:default> select contains(array[1,2,3], '1');
Query 20180914_053612_00006_pru6h failed: line 1:8: Unexpected parameters (array(integer), varchar(1)) for function contains. Expected: contains(array(T), T) T:comparable
select contains(array[1,2,3], '1')

presto:default> select contains(array[1,2,3], 'foo');
Query 20180914_053729_00007_pru6h failed: line 1:8: Unexpected parameters (array(integer), varchar(3)) for function contains. Expected: contains(array(T), T) T:comparable
select contains(array[1,2,3], 'foo')

presto:default> select contains(array['1','2','3'], 1);
Query 20180914_053850_00008_pru6h failed: line 1:8: Unexpected parameters (array(varchar(1)), integer) for function contains. Expected: contains(array(T), T) T:comparable
select contains(array['1','2','3'], 1)

presto:default> select contains(array[1,2,3], cast(1.0 as decimal(10,2)));
 _col0 
-------
 true  
(1 row)

presto:default> select contains(array[1,2,3], cast(1.0 as double));
 _col0 
-------
 true  
(1 row)

cloud-fan · 2018-09-14T15:14:52Z

what does presto return for select array_contains(array(1), 1.34);?

dilipbiswal · 2018-09-14T15:19:08Z

@cloud-fan It returns false.

presto:default> select contains(array[1], 1.34)
             -> ;
 _col0 
-------
 false 
(1 row)

dilipbiswal · 2018-09-14T15:22:52Z

@cloud-fan For the above case, from the plan, it seems like presto convert both sides to decimal(12,2)

presto:default> explain select contains(array[1], 1.34);
                                                                                      Query Plan                                                                              
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 - Output[_col0] => [contains:boolean]                                                                                                                                        
         Cost: {rows: 1 (10B), cpu: 10.00, memory: 0.00, network: 0.00}                                                                                                       
         _col0 := contains                                                                                                                                                    
     - Project[] => [contains:boolean]                                                                                                                                        
             Cost: {rows: 1 (10B), cpu: 10.00, memory: 0.00, network: 0.00}                                                                                                   
             contains := "contains"(CAST("$literal$array(integer)"("from_base64"('CQAAAElOVF9BUlJBWQEAAAAAAQAAAA==')) AS array(decimal(12,2))), CAST(DECIMAL '1.34' AS decimal
         - LocalExchange[ROUND_ROBIN] () =>                                                                                                                                   
                 Cost: {rows: 1 (0B), cpu: 0.00, memory: 0.00, network: 0.00}                                                                                                 
             - Values => []                                                                                                                                                   
                     Cost: {rows: 1 (0B), cpu: 0.00, memory: 0.00, network: 0.00}                                                                                             
                     ()

cloud-fan · 2018-09-17T05:32:19Z

LGTM. I think the last piece is the migration guide, to explain what changed from 2.3 to 2.4

dilipbiswal · 2018-09-17T06:28:47Z

@cloud-fan Thanks a lot. Let me take a stab at the migration guide changes.
cc @ueshin

dilipbiswal · 2018-09-17T07:53:15Z

@ueshin @cloud-fan @gatorsmile Can you please look at the doc changes when you get a chance.. Thanks.

SparkQA · 2018-09-17T11:50:21Z

Test build #96126 has finished for PR 22408 at commit d2d8afe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-17T12:06:47Z

docs/sql-programming-guide.md

hhmm, I think we can convert int to DecimalType(10, 0) without losing precision?

@cloud-fan Yes..Actually, i think, the logic to handle int and decimal promotion can be improved. I like presto's behaviour here -
int, decimal(3, 2) => decimal(12, 2)
int, decimal(4, 3) => decimal(13, 3)
int, decimal(11, 3) => decimal(14, 3)

Let me know what you think..

@cloud-fan I have taken a stab at fixing this at https://github.com/dilipbiswal/spark/tree/find_tightest
May i request you to please take a look ? I have run all the tests and they look ok. If you are okay, i will open a PR for this and once that gets in, we can look at this PR.

I left a few comments. Please send a PR, thanks!

cloud-fan · 2018-09-17T12:43:26Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

I think we have a bug in the findTightestCommonType. For an int and a decimal, findTightestCommonType can return a value if the decimal's precision is bigger than int. But it can't return a value if the decimal's precision is small.

cloud-fan · 2018-09-20T02:36:55Z

Since it's a little risky to backport #22448 , I'd like to merge this PR without #22448 . Can you fix the conflict? thanks!

cloud-fan · 2018-09-20T02:38:04Z

docs/sql-programming-guide.md

can we remove this example? or pick a different one that can work. Basically it should work if we fix the bug with #22448

@cloud-fan OK.

… right expression is implicitly down casted

cloud-fan · 2018-09-20T05:19:44Z

docs/sql-programming-guide.md

+            <b>false</b>
+          </th>
+          <th>
+            <b>In Spark 2.4, both left and right parameters are  promoted to array(double) and double type respectively.</b>


remove both.

cloud-fan · 2018-09-20T05:20:38Z

docs/sql-programming-guide.md

+            <b>true</b>
+          </th>
+          <th>
+            <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b>


We can promote int to string, but I'm not sure that's a common behavior in other databases

If presto doesn't do it, we should follow it.

@cloud-fan Yeah, presto gives error. Please refer to my earlier comment showing the presto output. Did you want anything to change in the description ?

Ah then it's fine, we don't need to change anything here.

cloud-fan · 2018-09-20T05:30:14Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

    )
+
+    checkAnswer(
+      df.selectExpr("array_contains(array(1), 1.23D)"),


this query doesn't read any data from df, so the 2 result rows are always same. Can we use OneRowRelation here?

cloud-fan · 2018-09-20T05:34:13Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

+    )
+
+    checkAnswer(
+      df.selectExpr("array_contains(array(array(1)), array(1.23))"),


hmm? shouldn't this fail because of the bug?

@cloud-fan Yes. it should :-) I think i had changed this test case to verify the fix to tighestCommonType.. and pushed it by mistake. Sorry about it.

cloud-fan · 2018-09-20T06:10:04Z

LGTM

SparkQA · 2018-09-20T07:05:02Z

Test build #96322 has finished for PR 22408 at commit df5ea47.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-20T07:05:02Z

Test build #96329 has finished for PR 22408 at commit d79e9d4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-09-20T07:06:44Z

retest this please

SparkQA · 2018-09-20T11:22:53Z

Test build #96333 has finished for PR 22408 at commit d79e9d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-20T12:34:19Z

thanks, merging to master/2.4!

… when right expression is implicitly down casted ## What changes were proposed in this pull request? In ArrayContains, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result. Example : ```SQL spark-sql> select array_contains(array(1), 1.34); true ``` ```SQL spark-sql> select array_contains(array(1), 'foo'); null ``` We should safely coerce both left and right hand side expressions. ## How was this patch tested? Added tests in DataFrameFunctionsSuite Closes #22408 from dilipbiswal/SPARK-25417. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 67f2c6a) Signed-off-by: Wenchen Fan <[email protected]>

dilipbiswal mentioned this pull request Sep 13, 2018

[SPARK-25416][SQL] ArrayPosition function may return incorrect result when right expression is implicitly down casted #22407

Closed

ueshin reviewed Sep 13, 2018

View reviewed changes

cloud-fan reviewed Sep 17, 2018

View reviewed changes

cloud-fan reviewed Sep 20, 2018

View reviewed changes

dilipbiswal added 4 commits September 19, 2018 21:07

[SPARK-25417] ArrayContains function may return incorrect result when…

d61fb6e

… right expression is implicitly down casted

python test failure

ce6d47a

Code review

bb3dbd5

minor

9a5cc2c

dilipbiswal added 3 commits September 19, 2018 21:11

Migration guide

56d131b

fix1

966e1ea

Code review

df5ea47

dilipbiswal force-pushed the SPARK-25417 branch from d2d8afe to df5ea47 Compare September 20, 2018 04:57

cloud-fan reviewed Sep 20, 2018

View reviewed changes

Code review

d79e9d4

asfgit closed this in 67f2c6a Sep 20, 2018

amanomer mentioned this pull request Dec 9, 2019

[SPARK-29600][SQL] ArrayContains function may return incorrect result for DecimalType #26811

Closed

[SPARK-25417][SQL] ArrayContains function may return incorrect result when right expression is implicitly down casted #22408

[SPARK-25417][SQL] ArrayContains function may return incorrect result when right expression is implicitly down casted #22408

Uh oh!

Conversation

dilipbiswal commented Sep 12, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 13, 2018

Uh oh!

SparkQA commented Sep 13, 2018

Uh oh!

SparkQA commented Sep 13, 2018

Uh oh!

dilipbiswal commented Sep 13, 2018

Uh oh!

SparkQA commented Sep 13, 2018

Uh oh!

maropu commented Sep 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dilipbiswal commented Sep 14, 2018

Uh oh!

maropu commented Sep 14, 2018

Uh oh!

dilipbiswal commented Sep 14, 2018

Uh oh!

gatorsmile commented Sep 14, 2018

Uh oh!

gatorsmile commented Sep 14, 2018

Uh oh!

dilipbiswal commented Sep 14, 2018

Uh oh!

cloud-fan commented Sep 14, 2018

Uh oh!

dilipbiswal commented Sep 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dilipbiswal commented Sep 14, 2018

Uh oh!

cloud-fan commented Sep 17, 2018

Uh oh!

dilipbiswal commented Sep 17, 2018

Uh oh!

dilipbiswal commented Sep 17, 2018

Uh oh!

SparkQA commented Sep 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dilipbiswal Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Sep 14, 2018 •

edited

Loading

dilipbiswal commented Sep 14, 2018 •

edited

Loading

dilipbiswal Sep 17, 2018 •

edited

Loading