Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

Currently findTightestCommonType is not able to coerce between integral and decimal types properly. For example, while trying to find a common type between (int, decimal) , it is able to find a common type only when the number of digits to the left of decimal point of the decimal number is >= 10. This PR enhances the logic to to correctly find a wider decimal type between the integral and decimal types.

Here are some examples of the result of findTightestCommonType

int, decimal(3, 2) => decimal(12, 2)
int, decimal(4, 3) => decimal(13, 3)
int, decimal(11, 3) => decimal(14, 3)
int, decimal(38, 18) => decimal(38, 18)
int, decimal(38, 29) => None 

How was this patch tested?

Added tests to TypeCoercionSuite.

@dilipbiswal
Copy link
Contributor Author

cc @cloud-fan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Would we need a guard here to be safe ? like
case (t1: IntegralType, t2: DecimalType) if findWiderDecimalType(DecimalType.forType(t1), t2)).isdefined =>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. We don't need to handle integral and decimal again after it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan ok... thanks !!

@SparkQA
Copy link

SparkQA commented Sep 18, 2018

Test build #96169 has finished for PR 22448 at commit 8946034.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reference for this implementation? I'm worried about corner cases like negative scale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Actually the bounded version is in DecimalPrecision::widerDecimalType. Thats the function i looked at as reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Then can we add some more tests with negative scale?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Added tests with -ve scale. Thanks !!

@cloud-fan
Copy link
Contributor

I think it's a bug fix instead of an improvement. findTightestCommonType is used for binary operators and it should be easy to write some end-to-end test cases to verify the bug.

@ueshin
Copy link
Member

ueshin commented Sep 18, 2018

I'm just wondering we should care about the case like decimal(3, 2) vs. decimal(5, 1)?

@dilipbiswal
Copy link
Contributor Author

@ueshin Can you please explain a bit ?

@dilipbiswal
Copy link
Contributor Author

@cloud-fan

I think it's a bug fix instead of an improvement. findTightestCommonType is used for binary operators and it should be easy to write some end-to-end test cases to verify the bug.

Please correct me on this one. I think for normal queries like select * from foo where c1 = 1.23 (where c1 is a integral type) casts are setup correctly through TypeCoercion::DecimalPrecision. Is that the kind of end-to-end test you had in mind ?

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipbiswal I was thinking whether or not we should handle the case like widenTest(DecimalType(3, 2), DecimalType(5, 1), Some(DecimalType(...))), which is currently None?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: revert this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@dilipbiswal
Copy link
Contributor Author

@ueshin

I was thinking whether or not we should handle the case like widenTest(DecimalType(3, 2), DecimalType(5, 1), Some(DecimalType(...))), which is currently None?

Thank you !! I think we should. Lets get the integral, decimal thing right first and then take on the (decimal, decimal). We never handled (decimal, decimal) in findTightestCommonType .. hopefully there are no repercussions :-).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: unnecessary space DecimalType.MAX_SCALE )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is scale <= DecimalType.MAX_SCALE necessary? If d1 and d2 are valid and their scale didn't reach DecimalType.MAX_SCALE, so the same condition is kept for max(d1.scale, d2.scale)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk You are right. I was not sure if we could come here with a invalid decimal i.e scale > MAX_SCALE.. Basically i looked at the bound method which does a min(scale, MAX_SCALE) and modelled it like that here to be defensive.

@SparkQA
Copy link

SparkQA commented Sep 18, 2018

Test build #96170 has finished for PR 22448 at commit 8946034.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 18, 2018

Test build #96195 has finished for PR 22448 at commit 8104b1f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val range = max(d1.precision - d1.scale, d2.precision - d2.scale)

// Check the resultant decimal type does not exceed the allowable limits.
if (range + scale <= DecimalType.MAX_PRECISION && scale <= DecimalType.MAX_SCALE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need scale <= DecimalType.MAX_SCALE? DecimalType.scale has been already validated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu OK.. i will remove this check.

@SparkQA
Copy link

SparkQA commented Sep 19, 2018

Test build #96204 has finished for PR 22448 at commit 0de3328.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 19, 2018

Test build #96222 has finished for PR 22448 at commit 2304c7d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@MaxGekk
Copy link
Member

MaxGekk commented Sep 19, 2018

Just in case, there is similar code in CSVInferSchema (and in JSON probably too):

case (t1: DecimalType, t2: DecimalType) =>
val scale = math.max(t1.scale, t2.scale)
val range = math.max(t1.precision - t1.scale, t2.precision - t2.scale)
if (range + scale > 38) {
// DecimalType can't support precision > 38
Some(DoubleType)
} else {
Some(DecimalType(range + scale, scale))
}

Maybe it makes sense to move checking to a common place like object DecimalType and reuse the method.

@SparkQA
Copy link

SparkQA commented Sep 19, 2018

Test build #96232 has finished for PR 22448 at commit 2304c7d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Sep 19, 2018

@MaxGekk Thank you. I was looking at CSVInferSchema. It seems like there is a copy of findTightestCommonType in this file ? Do you know the reason for it ? Seems like we may need to refactor to see if we can avoid duplicating findTightestCommonType here ? Can we take this in a follow-up ? If you think we should only focus on refactoring findWiderDecimalType and not worry about findTightestCommonType then please let me know and i can do it in this PR.

@MaxGekk
Copy link
Member

MaxGekk commented Sep 19, 2018

Do you know the reason for it ?

It is better to ask @HyukjinKwon

Seems like we may need to refactor to see if we can avoid duplicating findTightestCommonType here ? Can we take this in a follow-up ?

Definitely it can be done in a separate PR. Please, do that if you have time (if not I can do it).

@dilipbiswal
Copy link
Contributor Author

@cloud-fan does this look okay now ?

findWiderDecimalType(t1, DecimalType.forType(t2))

// Promote numeric types to the highest of the two
case (t1: NumericType, t2: NumericType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we handle 2 decimals as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Yeah.. Do you think it may conflict with DecimalConversion rule in anyway ? Let me run the tests first and see how it goes ..

* Finds a wider decimal type between the two supplied decimal types without
* any loss of precision.
*/
def findWiderDecimalType(d1: DecimalType, d2: DecimalType): Option[DecimalType] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to rename it to findTightestDecimalType, and add document to say what's the difference between this and findWiderTypeForDecimal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Sure.. will do.

widenTest(DecimalType(2, 1), DecimalType(3, 2), None)
widenTest(DecimalType(2, 1), DoubleType, None)
widenTest(DecimalType(2, 1), IntegerType, None)
widenTest(DecimalType(2, 1), IntegerType, Some(DecimalType(11, 1)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have one and only one positive and negative test case for each integral type(byte, short, int, long), and another positive and negative test case for negative scale with int type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon
Copy link
Member

Do you know the reason for it ?

Because there was a behaviour change IIRC when I looked into that code before.

@HyukjinKwon
Copy link
Member

Looks now we are able to deduplicate it now.

@dilipbiswal
Copy link
Contributor Author

@HyukjinKwon Thanks for checking it out.

@HyukjinKwon
Copy link
Member

@dilipbiswal, can you file another JIRA instead of SPARK-25417 specifically for type coercion?

@dilipbiswal
Copy link
Contributor Author

@HyukjinKwon OK.. will do.

@github-actions
Copy link

github-actions bot commented Jan 6, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Jan 6, 2020
@github-actions github-actions bot closed this Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants