Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

In ArrayContains, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result.

Example :

spark-sql> select array_contains(array(1), 1.34);
true
spark-sql> select array_contains(array(1), 'foo');
null

We should safely coerce both left and right hand side expressions.

How was this patch tested?

Added tests in DataFrameFunctionsSuite

@SparkQA
Copy link

SparkQA commented Sep 13, 2018

Test build #96009 has finished for PR 22408 at commit 2c6e8dc.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you revert this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Sure.. sorry.. didn't notice it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this fail?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you check the error message as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin This fails since 1.23 is parsed as decimal(3, 2). And findTightestCommonType can not find a common type between integer and decimal(3,2). I specifically added this case to draw it to your attention to see if we are okay with this behaviour :-). If we had it as 1.23D then we would have been able to find a common type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Did you want me to check the error message for all the cases ? or just a couple of them ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the behavior, we might be better to use findWiderTypeWithoutStringPromotion instead of findTightestCommonType.
And as for checking the error message, just those you added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Thank you. I also thought of using findWiderTypeWithoutStringPromotion but later changed it to findTightestCommonType. One question i had was, can findWiderTypeWithoutStringPromotion do a lossy conversion. From the class description it seems only findTightestCommonType does a absolute safe casting ? Please let me know what you think ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin FYI - i just pushed the changes addressing your comments. I have kept findTightestCommonType as is for now. I will change it based on your input and adjust the tests accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Yes, I can do a lossy conversion.
Seems like BinaryComparison uses wider DecimalType, so we could follow the behavior to use findWiderTypeWithoutStringPromotion.
cc @gatorsmile @cloud-fan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we can array_contains(array(1), '1') in 2.3 as #22408 (comment), we should use findWiderCommonType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Sure. We could use findWiderCommonType. My thinking was, since we are injecting this cast implicitly, we should pick the safest cast so we don't see data dependent surprises. Users could always specify an explicit cast and take the the responsibility of the result :-)

However, i don't have a strong opinion. I will change it use findWiderCommonType

@SparkQA
Copy link

SparkQA commented Sep 13, 2018

Test build #96017 has finished for PR 22408 at commit 4fb4c1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2018

Test build #96036 has finished for PR 22408 at commit 04915fd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2018

Test build #96031 has finished for PR 22408 at commit 63f6ce5.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 13, 2018

Test build #96044 has finished for PR 22408 at commit 04915fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Sep 14, 2018

we need to update the migration guide? e.g.,

// this patch
scala> sql("select array_contains(array(1), '1')").show
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(array(1), '1')' due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array<int>, string].; line 1 pos 7;
'Project [unresolvedalias(array_contains(array(1), 1), None)]
+- OneRowRelation

// v2.3.1
scala> sql("select array_contains(array(1), '1')").show
+----------------------------------------+
|array_contains(array(1), CAST(1 AS INT))|
+----------------------------------------+
|                                    true|
+----------------------------------------+

@dilipbiswal
Copy link
Contributor Author

@maropu I thought we added this function in 2.4 :-). Yeah.. i will update the migration guide.

@maropu
Copy link
Member

maropu commented Sep 14, 2018

How about this decimal case?

// v2.3.1
scala> spark.range(10).selectExpr("cast(id AS decimal(9, 0)) as value").selectExpr("array_contains(array(1, 2, 3), value)").show
+--------------------------------------------------+
|array_contains(array(1, 2, 3), CAST(value AS INT))|
+--------------------------------------------------+
|                                             false|
|                                              true|
|                                              true|
|                                              true|
|                                             false|
|                                             false|
|                                             false|
|                                             false|
|                                             false|
|                                             false|
+--------------------------------------------------+

// this patch
scala> spark.range(10).selectExpr("cast(id AS decimal(9, 0)) as value").selectExpr("array_contains(array(1, 2, 3), value)").show
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(array(1, 2, 3), `value`)' due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array<int>, decimal(9,0)].; line 1 pos 0;
'Project [unresolvedalias(array_contains(array(1, 2, 3), value#2), Some(<function1>))]
+- Project [cast(id#0L as decimal(9,0)) AS value#2]
   +- Range (0, 10, step=1, splits=Some(4))

@dilipbiswal
Copy link
Contributor Author

@maropu This is the case we were discussing for which @ueshin suggested using findWiderTypeWithoutStringPromotion. Lets see what @cloud-fan and @gatorsmile think and we will do accordingly. We are picking a much restrictive cast (since we are injecting this implicitly).

@gatorsmile
Copy link
Member

What is the corresponding behavior in Presto?

@gatorsmile
Copy link
Member

My general idea is to avoid risky implicit type casting at the beginning. We can relax it in the future, if needed. After all, users can manually cast the types after seeing the reasonable error message. This should not be a big deal. However, returning a confusing result due to implicit type casting is not good in general.

@dilipbiswal
Copy link
Contributor Author

@ushin @gatorsmile Here are the results from presto. Please let me know if you want me try any case in particular. One thing to note is that presto allows comparison between int and decimal. In our findTightestCommonType we don't do the promotion.

presto:default> select contains(array[1,2,3], '1');
Query 20180914_053612_00006_pru6h failed: line 1:8: Unexpected parameters (array(integer), varchar(1)) for function contains. Expected: contains(array(T), T) T:comparable
select contains(array[1,2,3], '1')

presto:default> select contains(array[1,2,3], 'foo');
Query 20180914_053729_00007_pru6h failed: line 1:8: Unexpected parameters (array(integer), varchar(3)) for function contains. Expected: contains(array(T), T) T:comparable
select contains(array[1,2,3], 'foo')

presto:default> select contains(array['1','2','3'], 1);
Query 20180914_053850_00008_pru6h failed: line 1:8: Unexpected parameters (array(varchar(1)), integer) for function contains. Expected: contains(array(T), T) T:comparable
select contains(array['1','2','3'], 1)

presto:default> select contains(array[1,2,3], cast(1.0 as decimal(10,2)));
 _col0 
-------
 true  
(1 row)

presto:default> select contains(array[1,2,3], cast(1.0 as double));
 _col0 
-------
 true  
(1 row)

@cloud-fan
Copy link
Contributor

what does presto return for select array_contains(array(1), 1.34);?

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Sep 14, 2018

@cloud-fan It returns false.

presto:default> select contains(array[1], 1.34)
             -> ;
 _col0 
-------
 false 
(1 row)

@dilipbiswal
Copy link
Contributor Author

@cloud-fan For the above case, from the plan, it seems like presto convert both sides to decimal(12,2)

presto:default> explain select contains(array[1], 1.34);
                                                                                      Query Plan                                                                              
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 - Output[_col0] => [contains:boolean]                                                                                                                                        
         Cost: {rows: 1 (10B), cpu: 10.00, memory: 0.00, network: 0.00}                                                                                                       
         _col0 := contains                                                                                                                                                    
     - Project[] => [contains:boolean]                                                                                                                                        
             Cost: {rows: 1 (10B), cpu: 10.00, memory: 0.00, network: 0.00}                                                                                                   
             contains := "contains"(CAST("$literal$array(integer)"("from_base64"('CQAAAElOVF9BUlJBWQEAAAAAAQAAAA==')) AS array(decimal(12,2))), CAST(DECIMAL '1.34' AS decimal
         - LocalExchange[ROUND_ROBIN] () =>                                                                                                                                   
                 Cost: {rows: 1 (0B), cpu: 0.00, memory: 0.00, network: 0.00}                                                                                                 
             - Values => []                                                                                                                                                   
                     Cost: {rows: 1 (0B), cpu: 0.00, memory: 0.00, network: 0.00}                                                                                             
                     ()     

@cloud-fan
Copy link
Contributor

LGTM. I think the last piece is the migration guide, to explain what changed from 2.3 to 2.4

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thanks a lot. Let me take a stab at the migration guide changes.
cc @ueshin

@dilipbiswal
Copy link
Contributor Author

@ueshin @cloud-fan @gatorsmile Can you please look at the doc changes when you get a chance.. Thanks.

@SparkQA
Copy link

SparkQA commented Sep 17, 2018

Test build #96126 has finished for PR 22408 at commit d2d8afe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hhmm, I think we can convert int to DecimalType(10, 0) without losing precision?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Yes..Actually, i think, the logic to handle int and decimal promotion can be improved. I like presto's behaviour here -
int, decimal(3, 2) => decimal(12, 2)
int, decimal(4, 3) => decimal(13, 3)
int, decimal(11, 3) => decimal(14, 3)

Let me know what you think..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I have taken a stab at fixing this at https://github.com/dilipbiswal/spark/tree/find_tightest
May i request you to please take a look ? I have run all the tests and they look ok. If you are okay, i will open a PR for this and once that gets in, we can look at this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments. Please send a PR, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a bug in the findTightestCommonType. For an int and a decimal, findTightestCommonType can return a value if the decimal's precision is bigger than int. But it can't return a value if the decimal's precision is small.

@cloud-fan
Copy link
Contributor

Since it's a little risky to backport #22448 , I'd like to merge this PR without #22448 . Can you fix the conflict? thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this example? or pick a different one that can work. Basically it should work if we fix the bug with #22448

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<b>false</b>
</th>
<th>
<b>In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively.</b>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove both.

<b>true</b>
</th>
<th>
<b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can promote int to string, but I'm not sure that's a common behavior in other databases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If presto doesn't do it, we should follow it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Yeah, presto gives error. Please refer to my earlier comment showing the presto output. Did you want anything to change in the description ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah then it's fine, we don't need to change anything here.

)

checkAnswer(
df.selectExpr("array_contains(array(1), 1.23D)"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this query doesn't read any data from df, so the 2 result rows are always same. Can we use OneRowRelation here?

)

checkAnswer(
df.selectExpr("array_contains(array(array(1)), array(1.23))"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm? shouldn't this fail because of the bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Yes. it should :-) I think i had changed this test case to verify the fix to tighestCommonType.. and pushed it by mistake. Sorry about it.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Sep 20, 2018

Test build #96322 has finished for PR 22408 at commit df5ea47.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 20, 2018

Test build #96329 has finished for PR 22408 at commit d79e9d4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 20, 2018

Test build #96333 has finished for PR 22408 at commit d79e9d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.4!

asfgit pushed a commit that referenced this pull request Sep 20, 2018
… when right expression is implicitly down casted

## What changes were proposed in this pull request?
In ArrayContains, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result.

Example :
```SQL
spark-sql> select array_contains(array(1), 1.34);
true
```
```SQL
spark-sql> select array_contains(array(1), 'foo');
null
```

We should safely coerce both left and right hand side expressions.
## How was this patch tested?
Added tests in DataFrameFunctionsSuite

Closes #22408 from dilipbiswal/SPARK-25417.

Authored-by: Dilip Biswal <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 67f2c6a)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 67f2c6a Sep 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants