-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25417][SQL] Improve findTightestCommonType to coerce Integral and decimal types #22448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Would we need a guard here to be safe ? like
case (t1: IntegralType, t2: DecimalType) if findWiderDecimalType(DecimalType.forType(t1), t2)).isdefined =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. We don't need to handle integral and decimal again after it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan ok... thanks !!
|
Test build #96169 has finished for PR 22448 at commit
|
|
retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reference for this implementation? I'm worried about corner cases like negative scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Actually the bounded version is in DecimalPrecision::widerDecimalType. Thats the function i looked at as reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Then can we add some more tests with negative scale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Added tests with -ve scale. Thanks !!
|
I think it's a bug fix instead of an improvement. |
|
I'm just wondering we should care about the case like |
|
@ueshin Can you please explain a bit ? |
Please correct me on this one. I think for normal queries like |
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dilipbiswal I was thinking whether or not we should handle the case like widenTest(DecimalType(3, 2), DecimalType(5, 1), Some(DecimalType(...))), which is currently None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: revert this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
Thank you !! I think we should. Lets get the integral, decimal thing right first and then take on the (decimal, decimal). We never handled (decimal, decimal) in findTightestCommonType .. hopefully there are no repercussions :-). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: unnecessary space DecimalType.MAX_SCALE )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is scale <= DecimalType.MAX_SCALE necessary? If d1 and d2 are valid and their scale didn't reach DecimalType.MAX_SCALE, so the same condition is kept for max(d1.scale, d2.scale)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk You are right. I was not sure if we could come here with a invalid decimal i.e scale > MAX_SCALE.. Basically i looked at the bound method which does a min(scale, MAX_SCALE) and modelled it like that here to be defensive.
|
Test build #96170 has finished for PR 22448 at commit
|
|
Test build #96195 has finished for PR 22448 at commit
|
8104b1f to
0de3328
Compare
| val range = max(d1.precision - d1.scale, d2.precision - d2.scale) | ||
|
|
||
| // Check the resultant decimal type does not exceed the allowable limits. | ||
| if (range + scale <= DecimalType.MAX_PRECISION && scale <= DecimalType.MAX_SCALE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need scale <= DecimalType.MAX_SCALE? DecimalType.scale has been already validated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu OK.. i will remove this check.
|
Test build #96204 has finished for PR 22448 at commit
|
|
Test build #96222 has finished for PR 22448 at commit
|
|
retest this please |
|
Just in case, there is similar code in spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala Lines 210 to 218 in 5264164
Maybe it makes sense to move checking to a common place like |
|
Test build #96232 has finished for PR 22448 at commit
|
|
@MaxGekk Thank you. I was looking at CSVInferSchema. It seems like there is a copy of |
It is better to ask @HyukjinKwon
Definitely it can be done in a separate PR. Please, do that if you have time (if not I can do it). |
|
@cloud-fan does this look okay now ? |
| findWiderDecimalType(t1, DecimalType.forType(t2)) | ||
|
|
||
| // Promote numeric types to the highest of the two | ||
| case (t1: NumericType, t2: NumericType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we handle 2 decimals as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Yeah.. Do you think it may conflict with DecimalConversion rule in anyway ? Let me run the tests first and see how it goes ..
| * Finds a wider decimal type between the two supplied decimal types without | ||
| * any loss of precision. | ||
| */ | ||
| def findWiderDecimalType(d1: DecimalType, d2: DecimalType): Option[DecimalType] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to rename it to findTightestDecimalType, and add document to say what's the difference between this and findWiderTypeForDecimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Sure.. will do.
| widenTest(DecimalType(2, 1), DecimalType(3, 2), None) | ||
| widenTest(DecimalType(2, 1), DoubleType, None) | ||
| widenTest(DecimalType(2, 1), IntegerType, None) | ||
| widenTest(DecimalType(2, 1), IntegerType, Some(DecimalType(11, 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have one and only one positive and negative test case for each integral type(byte, short, int, long), and another positive and negative test case for negative scale with int type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan OK.
Because there was a behaviour change IIRC when I looked into that code before. |
|
Looks now we are able to deduplicate it now. |
|
@HyukjinKwon Thanks for checking it out. |
|
@dilipbiswal, can you file another JIRA instead of SPARK-25417 specifically for type coercion? |
|
@HyukjinKwon OK.. will do. |
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
What changes were proposed in this pull request?
Currently
findTightestCommonTypeis not able to coerce between integral and decimal types properly. For example, while trying to find a common type between (int, decimal) , it is able to find a common type only when the number of digits to the left of decimal point of the decimal number is >= 10. This PR enhances the logic to to correctly find a wider decimal type between the integral and decimal types.Here are some examples of the result of
findTightestCommonTypeHow was this patch tested?
Added tests to TypeCoercionSuite.