-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-6033] Fix rounding exception when to decimal casting #8380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@xiarixiaoyao Can you please help to review this? Thank you! |
|
@voonhous |
|
For conversions that loses accuracy, this really depends on the data that's in the table. I do not think a check on all existing data is feasible. i.e. checking all existing data to ensure that there is no loss in accuracy, and if there is one, we will need to throw an exception to prevent the IIUC, the only time we can perform this check is when we are doing the ALTER TABLE DDL as a DECIMAL -> float/double schema change (reversing whatever is done) is not allowed. As such, I believe we will need to document this potential loss in accuracy to remind users of such a behaviour. |
doc +1 |
|
@hudi-bot run azure |
|
@hudi-bot run azure |
|
@hudi-bot run azure |
|
@danny0405 Fixed the CI here i forgot to remove a test case that i was using for debugging. Can we merge this in soon? Thank you! After this is fixed, we will need to align the I will raise another PR for it. |
|
Let's give a summary about the changes: float -> decimal: HALF_EVEN
Can you show us the code snippet how Spark does this then? Is the Spark behavior a standard that we need to follow? |
This is the code that i referred to. In general, the default rounding mode is On top of that, I did a simple verification for fixed types which i included in the jira ticket. I'll paste them over here too: -- test HALF_UP rounding (verify that it does not use HALF_EVEN)
> SELECT CAST(CAST("10.024" AS DOUBLE) AS DECIMAL(4, 2));
10.02
> SELECT CAST(CAST("10.025" AS DOUBLE) AS DECIMAL(4, 2));
10.03
> SELECT CAST(CAST("10.026" AS DOUBLE) AS DECIMAL(4, 2));
10.03
-- test negative HALF_UP rounding (verify that it does not use HALF_EVEN)
> SELECT CAST(CAST("-10.024" AS DOUBLE) AS DECIMAL(4, 2));
-10.02
> SELECT CAST(CAST("-10.025" AS DOUBLE) AS DECIMAL(4, 2));
-10.03
> SELECT CAST(CAST("-10.026" AS DOUBLE) AS DECIMAL(4, 2));
-10.03
-- test negative HALF_UP rounding (will return same result as HALF_EVEN)
> SELECT CAST(CAST("10.034" AS DOUBLE) AS DECIMAL(4, 2));
10.03
> SELECT CAST(CAST("10.035" AS DOUBLE) AS DECIMAL(4, 2));
10.04
> SELECT CAST(CAST("10.036" AS DOUBLE) AS DECIMAL(4, 2));
10.04
-- test negative HALF_UP rounding (will return same result as HALF_EVEN)
> SELECT CAST(CAST("-10.034" AS DOUBLE) AS DECIMAL(4, 2));
-10.03
> SELECT CAST(CAST("-10.035" AS DOUBLE) AS DECIMAL(4, 2));
-10.04
> SELECT CAST(CAST("-10.036" AS DOUBLE) AS DECIMAL(4, 2));
-10.04As can be seen from the results i pasted above, general casting rules where scale is lost will use a rounding mode of HALF_UP. |
|
It seems that Spark always uses the HALF_UP rounding mode, the HALF_EVEN mode is only used in tests. So the main reason we use HALF_EVEN for float -> decimal conversion is to avoid error throwing right? |
HALF_EVEN is used OUTSIDE of tests. For example: This line of code was introduced here: We'll have to ask @alexeykudinkin As to why HALF_EVEN was used, I am not sure why... But if I were to hazard a guess, it is to avoid bias, where numbers > 0.5 will always be rounded up if HALF_UP was used, where it would shift the average value fo the set of numerics. If HALF_EVEN was used, and there is an equal number of numerics that are > 0.5 that are rounded to the nearest even number in both the UP and DOWN direction, the original average value will not shift as much. Not really sure... Just throwing my assumptions out there. |
I'm talking about Apache Spark, not Hudi. |
Not very sure about this, but I have yet to encounter a use-case where anything other than HALF_UP was used. |
f8183f5 to
09c2c89
Compare
| * <ul> | ||
| * <li>double/string -> decimal: HALF_UP</li> | ||
| * <li>float -> decimal: HALF_EVEN</li> | ||
| * </ul> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the code of some computing engines
spark always use HALF_UP to cast FractionalType(float/double) to DecimalType
why we use HAL_EVEN to cast float -> decimal
BTW hive/presto also use HALF_UP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiarixiaoyao The original code that i am modifying uses HALF_EVEN.
According to the git-blame, the HALF_EVEN rounding mode was introduced by @alexeykudinkin.
This line of code was introduced here:
#7769
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should unify all the rounding mode to HALF_UP to avoid ambiguity. And another confusion is #7769, all the numeric and string types use the HALP_EVEN, why this patch only fixes DOUBLE and STRING type, the strategy are out of sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial considerations for only applying this to DOUBLE and STRING was because these are exact representation of fractional numbers.
i.e. 1.5 = 1.5.
For any other types, like float, these are not exact.
i.e. 1.5 is usually represented as 1.49918237
As such, given that HALF_EVEN was introduced in #7769, i wanted to limit the change to the types to exact types.
Since we want to standardize all rounding mode to HALF_UP, this method is no long required then. I will standardise everything to HALF_UP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I've standardized the RoundingMode here 8bfaea8
|
@danny0405 Can you please help to review this PR again? resolved the merge conflicts caused by changes in javadoc. |
|
@hudi-bot run azure |
danny0405
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Standardise rounding mode to HALF_UP, engines like Spark, Presto and Hive also use HALF_UP as the default rounding mode.
Standardise rounding mode to HALF_UP, engines like Spark, Presto and Hive also use HALF_UP as the default rounding mode.
Change Logs
Fix rounding exception when there's a loss in precision while performing a XXX to DECIMAL(p, s) unsafe casting where a loss in scale is used.
For stacktrace and detailed explanation of why HALF_EVEN and special cases where HALF_UP rounding scheme was used, please refer to the JIRA ticket here: HUDI-6033
TLDR:
Add rounding mode to to fix rounding exceptions being thrown.
float, double -> decimal: HALF_UP,
Impact
None
Describe any public API or user-facing feature change or any performance impact.
Risk level (write none, low medium or high below)
None
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist