[SPARK-25417][SQL] Improve findTightestCommonType to coerce Integral and decimal types #22448

dilipbiswal · 2018-09-18T06:12:10Z

What changes were proposed in this pull request?

Currently findTightestCommonType is not able to coerce between integral and decimal types properly. For example, while trying to find a common type between (int, decimal) , it is able to find a common type only when the number of digits to the left of decimal point of the decimal number is >= 10. This PR enhances the logic to to correctly find a wider decimal type between the integral and decimal types.

Here are some examples of the result of findTightestCommonType

int, decimal(3, 2) => decimal(12, 2)
int, decimal(4, 3) => decimal(13, 3)
int, decimal(11, 3) => decimal(14, 3)
int, decimal(38, 18) => decimal(38, 18)
int, decimal(38, 29) => None

How was this patch tested?

Added tests to TypeCoercionSuite.

dilipbiswal · 2018-09-18T06:12:22Z

cc @cloud-fan

dilipbiswal · 2018-09-18T06:22:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

@cloud-fan Would we need a guard here to be safe ? like
case (t1: IntegralType, t2: DecimalType) if findWiderDecimalType(DecimalType.forType(t1), t2)).isdefined =>

I don't think so. We don't need to handle integral and decimal again after it.

@cloud-fan ok... thanks !!

SparkQA · 2018-09-18T07:05:02Z

Test build #96169 has finished for PR 22448 at commit 8946034.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-09-18T07:09:38Z

retest this please.

cloud-fan · 2018-09-18T07:13:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

Is there a reference for this implementation? I'm worried about corner cases like negative scale.

@cloud-fan Actually the bounded version is in DecimalPrecision::widerDecimalType. Thats the function i looked at as reference.

ok. Then can we add some more tests with negative scale?

@cloud-fan Added tests with -ve scale. Thanks !!

cloud-fan · 2018-09-18T07:16:33Z

I think it's a bug fix instead of an improvement. findTightestCommonType is used for binary operators and it should be easy to write some end-to-end test cases to verify the bug.

ueshin · 2018-09-18T07:29:20Z

I'm just wondering we should care about the case like decimal(3, 2) vs. decimal(5, 1)?

dilipbiswal · 2018-09-18T07:46:22Z

@ueshin Can you please explain a bit ?

dilipbiswal · 2018-09-18T08:20:23Z

@cloud-fan

I think it's a bug fix instead of an improvement. findTightestCommonType is used for binary operators and it should be easy to write some end-to-end test cases to verify the bug.

Please correct me on this one. I think for normal queries like select * from foo where c1 = 1.23 (where c1 is a integral type) casts are setup correctly through TypeCoercion::DecimalPrecision. Is that the kind of end-to-end test you had in mind ?

ueshin

@dilipbiswal I was thinking whether or not we should handle the case like widenTest(DecimalType(3, 2), DecimalType(5, 1), Some(DecimalType(...))), which is currently None?

ueshin · 2018-09-18T08:19:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

nit: revert this?

dilipbiswal · 2018-09-18T08:51:12Z

@ueshin

I was thinking whether or not we should handle the case like widenTest(DecimalType(3, 2), DecimalType(5, 1), Some(DecimalType(...))), which is currently None?

Thank you !! I think we should. Lets get the integral, decimal thing right first and then take on the (decimal, decimal). We never handled (decimal, decimal) in findTightestCommonType .. hopefully there are no repercussions :-).

maropu · 2018-09-18T09:28:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

super nit: unnecessary space DecimalType.MAX_SCALE )

Is scale <= DecimalType.MAX_SCALE necessary? If d1 and d2 are valid and their scale didn't reach DecimalType.MAX_SCALE, so the same condition is kept for max(d1.scale, d2.scale)

@MaxGekk You are right. I was not sure if we could come here with a invalid decimal i.e scale > MAX_SCALE.. Basically i looked at the bound method which does a min(scale, MAX_SCALE) and modelled it like that here to be defensive.

SparkQA · 2018-09-18T10:16:43Z

Test build #96170 has finished for PR 22448 at commit 8946034.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-18T22:18:35Z

Test build #96195 has finished for PR 22448 at commit 8104b1f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-09-19T01:00:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+    val range = max(d1.precision - d1.scale, d2.precision - d2.scale)
+
+    // Check the resultant decimal type does not exceed the allowable limits.
+    if (range + scale <= DecimalType.MAX_PRECISION && scale <= DecimalType.MAX_SCALE) {


We need scale <= DecimalType.MAX_SCALE? DecimalType.scale has been already validated?

@maropu OK.. i will remove this check.

SparkQA · 2018-09-19T02:30:04Z

Test build #96204 has finished for PR 22448 at commit 0de3328.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-19T07:05:02Z

Test build #96222 has finished for PR 22448 at commit 2304c7d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-09-19T07:08:45Z

retest this please

MaxGekk · 2018-09-19T10:31:33Z

Just in case, there is similar code in CSVInferSchema (and in JSON probably too):

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

Lines 210 to 218 in 5264164

    
           case (t1: DecimalType, t2: DecimalType) => 
        
             val scale = math.max(t1.scale, t2.scale) 
        
             val range = math.max(t1.precision - t1.scale, t2.precision - t2.scale) 
        
             if (range + scale > 38) { 
        
               // DecimalType can't support precision > 38 
        
               Some(DoubleType) 
        
             } else { 
        
               Some(DecimalType(range + scale, scale)) 
        
             }

Maybe it makes sense to move checking to a common place like object DecimalType and reuse the method.

SparkQA · 2018-09-19T11:09:30Z

Test build #96232 has finished for PR 22448 at commit 2304c7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-09-19T19:15:13Z

@MaxGekk Thank you. I was looking at CSVInferSchema. It seems like there is a copy of findTightestCommonType in this file ? Do you know the reason for it ? Seems like we may need to refactor to see if we can avoid duplicating findTightestCommonType here ? Can we take this in a follow-up ? If you think we should only focus on refactoring findWiderDecimalType and not worry about findTightestCommonType then please let me know and i can do it in this PR.

MaxGekk · 2018-09-19T19:27:33Z

Do you know the reason for it ?

It is better to ask @HyukjinKwon

Seems like we may need to refactor to see if we can avoid duplicating findTightestCommonType here ? Can we take this in a follow-up ?

Definitely it can be done in a separate PR. Please, do that if you have time (if not I can do it).

dilipbiswal · 2018-09-20T02:24:33Z

@cloud-fan does this look okay now ?

cloud-fan · 2018-09-20T02:26:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+      findWiderDecimalType(t1, DecimalType.forType(t2))

    // Promote numeric types to the highest of the two
    case (t1: NumericType, t2: NumericType)


shall we handle 2 decimals as well?

@cloud-fan Yeah.. Do you think it may conflict with DecimalConversion rule in anyway ? Let me run the tests first and see how it goes ..

cloud-fan · 2018-09-20T02:32:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+   * Finds a wider decimal type between the two supplied decimal types without
+   * any loss of precision.
+   */
+  def findWiderDecimalType(d1: DecimalType, d2: DecimalType): Option[DecimalType] = {


I'd like to rename it to findTightestDecimalType, and add document to say what's the difference between this and findWiderTypeForDecimal

@cloud-fan Sure.. will do.

cloud-fan · 2018-09-20T02:35:15Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

    widenTest(DecimalType(2, 1), DecimalType(3, 2), None)
    widenTest(DecimalType(2, 1), DoubleType, None)
-    widenTest(DecimalType(2, 1), IntegerType, None)
+    widenTest(DecimalType(2, 1), IntegerType, Some(DecimalType(11, 1)))


We should have one and only one positive and negative test case for each integral type(byte, short, int, long), and another positive and negative test case for negative scale with int type.

@cloud-fan OK.

HyukjinKwon · 2018-10-03T06:17:08Z

Do you know the reason for it ?

Because there was a behaviour change IIRC when I looked into that code before.

HyukjinKwon · 2018-10-03T06:31:44Z

Looks now we are able to deduplicate it now.

dilipbiswal · 2018-10-03T06:33:50Z

@HyukjinKwon Thanks for checking it out.

HyukjinKwon · 2018-10-06T06:48:47Z

@dilipbiswal, can you file another JIRA instead of SPARK-25417 specifically for type coercion?

dilipbiswal · 2018-10-06T06:52:55Z

@HyukjinKwon OK.. will do.

github-actions · 2020-01-06T00:07:33Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

dilipbiswal commented Sep 18, 2018

View reviewed changes

cloud-fan reviewed Sep 18, 2018

View reviewed changes

ueshin reviewed Sep 18, 2018

View reviewed changes

maropu reviewed Sep 18, 2018

View reviewed changes

dilipbiswal added 3 commits September 18, 2018 15:28

Improve findTightestCommonType to coerce Integral and decimal types

1cd459f

remove space

6c29f05

Add more tests with negative scale

0de3328

dilipbiswal force-pushed the find_tightest branch from 8104b1f to 0de3328 Compare September 18, 2018 22:28

maropu reviewed Sep 19, 2018

View reviewed changes

review

2304c7d

cloud-fan reviewed Sep 20, 2018

View reviewed changes

cloud-fan mentioned this pull request Sep 20, 2018

[SPARK-25417][SQL] ArrayContains function may return incorrect result when right expression is implicitly down casted #22408

Closed

ueshin mentioned this pull request Sep 21, 2018

[SPARK-25416][SQL] ArrayPosition function may return incorrect result when right expression is implicitly down casted #22407

Closed

ueshin mentioned this pull request Oct 3, 2018

[SPARK-25600][SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while inferring CSV schema. #22619

Closed

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 6, 2020

github-actions bot closed this Jan 7, 2020

[SPARK-25417][SQL] Improve findTightestCommonType to coerce Integral and decimal types #22448

[SPARK-25417][SQL] Improve findTightestCommonType to coerce Integral and decimal types #22448

Uh oh!

Conversation

dilipbiswal commented Sep 18, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dilipbiswal commented Sep 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 18, 2018

Uh oh!

dilipbiswal commented Sep 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 18, 2018

Uh oh!

ueshin commented Sep 18, 2018

Uh oh!

dilipbiswal commented Sep 18, 2018

Uh oh!

dilipbiswal commented Sep 18, 2018

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dilipbiswal commented Sep 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 18, 2018

Uh oh!

SparkQA commented Sep 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 19, 2018

Uh oh!

SparkQA commented Sep 19, 2018

Uh oh!

dilipbiswal commented Sep 19, 2018

Uh oh!

MaxGekk commented Sep 19, 2018

Uh oh!

SparkQA commented Sep 19, 2018

Uh oh!

dilipbiswal commented Sep 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Sep 19, 2018

Uh oh!

dilipbiswal commented Sep 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dilipbiswal commented Sep 19, 2018 •

edited

Loading