[HUDI-6033] Fix rounding exception when to decimal casting #8380

voonhous · 2023-04-04T11:59:02Z

Change Logs

Fix rounding exception when there's a loss in precision while performing a XXX to DECIMAL(p, s) unsafe casting where a loss in scale is used.

For stacktrace and detailed explanation of why HALF_EVEN and special cases where HALF_UP rounding scheme was used, please refer to the JIRA ticket here: HUDI-6033

TLDR:
Add rounding mode to to fix rounding exceptions being thrown.

float, double -> decimal: HALF_UP,

Impact

None

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

None

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

voonhous · 2023-04-04T12:01:23Z

@xiarixiaoyao Can you please help to review this?

Thank you!

xiarixiaoyao · 2023-04-06T07:07:47Z

@voonhous
thanks for your contribution.
looks good. only a question, In this scenario where conversion loses accuracy, should we force the conversion or just throw an exception to remind the user

voonhous · 2023-04-06T07:47:24Z

@xiarixiaoyao

For conversions that loses accuracy, this really depends on the data that's in the table.

I do not think a check on all existing data is feasible. i.e. checking all existing data to ensure that there is no loss in accuracy, and if there is one, we will need to throw an exception to prevent the ALTER TABLE CHANGE COLUMN TYPE DDL to fail.

IIUC, the only time we can perform this check is when we are doing the ALTER TABLE DDL as a DECIMAL -> float/double schema change (reversing whatever is done) is not allowed.

As such, I believe we will need to document this potential loss in accuracy to remind users of such a behaviour.

xiarixiaoyao · 2023-04-06T13:34:08Z

@xiarixiaoyao

For conversions that loses accuracy, this really depends on the data that's in the table.

I do not think a check on all existing data is feasible. i.e. checking all existing data to ensure that there is no loss in accuracy, and if there is one, we will need to throw an exception to prevent the ALTER TABLE CHANGE COLUMN TYPE DDL to fail.

IIUC, the only time we can perform this check is when we are doing the ALTER TABLE DDL as a DECIMAL -> float/double schema change (reversing whatever is done) is not allowed.

As such, I believe we will need to document this potential loss in accuracy to remind users of such a behaviour.

doc +1

xiarixiaoyao · 2023-04-07T01:32:18Z

@hudi-bot run azure

voonhous · 2023-04-07T16:28:47Z

@hudi-bot run azure

voonhous · 2023-04-10T11:19:27Z

@hudi-bot run azure

voonhous · 2023-04-10T13:25:34Z

@danny0405 Fixed the CI here i forgot to remove a test case that i was using for debugging.

Can we merge this in soon? Thank you!

After this is fixed, we will need to align the HoodieSparkRecordMerger to ensure that the correct rounding mode is used.
i.e. the output after rewriting to a new schema produces consistent results across the different record merger implementations.

I will raise another PR for it.

danny0405 · 2023-04-11T04:12:16Z

Let's give a summary about the changes:

float -> decimal: HALF_EVEN
double -> decimal: HALF_UP,

Since SPARK uses HALF_UP rounding when performing a FIX_SCALE_TYPE to DECIMAL casting when there is a lost in scale

Can you show us the code snippet how Spark does this then? Is the Spark behavior a standard that we need to follow?

voonhous · 2023-04-11T05:54:39Z

Let's give a summary about the changes:

float -> decimal: HALF_EVEN double -> decimal: HALF_UP,

Since SPARK uses HALF_UP rounding when performing a FIX_SCALE_TYPE to DECIMAL casting when there is a lost in scale

Can you show us the code snippet how Spark does this then? Is the Spark behavior a standard that we need to follow?

This is the code that i referred to. In general, the default rounding mode is HALF_UP.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

On top of that, I did a simple verification for fixed types which i included in the jira ticket. I'll paste them over here too:

-- test  HALF_UP rounding (verify that it does not use HALF_EVEN)
> SELECT CAST(CAST("10.024" AS DOUBLE) AS DECIMAL(4, 2));
10.02

> SELECT CAST(CAST("10.025" AS DOUBLE) AS DECIMAL(4, 2));
10.03

> SELECT CAST(CAST("10.026" AS DOUBLE) AS DECIMAL(4, 2));
10.03


-- test negative HALF_UP rounding (verify that it does not use HALF_EVEN)
> SELECT CAST(CAST("-10.024" AS DOUBLE) AS DECIMAL(4, 2));
-10.02

> SELECT CAST(CAST("-10.025" AS DOUBLE) AS DECIMAL(4, 2));
-10.03

> SELECT CAST(CAST("-10.026" AS DOUBLE) AS DECIMAL(4, 2));
-10.03


-- test negative HALF_UP rounding (will return same result as HALF_EVEN)
> SELECT CAST(CAST("10.034" AS DOUBLE) AS DECIMAL(4, 2));
10.03

> SELECT CAST(CAST("10.035" AS DOUBLE) AS DECIMAL(4, 2));
10.04

> SELECT CAST(CAST("10.036" AS DOUBLE) AS DECIMAL(4, 2));
10.04


-- test negative HALF_UP rounding (will return same result as HALF_EVEN)
> SELECT CAST(CAST("-10.034" AS DOUBLE) AS DECIMAL(4, 2));
-10.03

> SELECT CAST(CAST("-10.035" AS DOUBLE) AS DECIMAL(4, 2));
-10.04

> SELECT CAST(CAST("-10.036" AS DOUBLE) AS DECIMAL(4, 2));
-10.04

As can be seen from the results i pasted above, general casting rules where scale is lost will use a rounding mode of HALF_UP.

danny0405 · 2023-04-11T06:23:38Z

It seems that Spark always uses the HALF_UP rounding mode, the HALF_EVEN mode is only used in tests. So the main reason we use HALF_EVEN for float -> decimal conversion is to avoid error throwing right?

voonhous · 2023-04-11T06:30:36Z

It seems that Spark always uses the HALF_UP rounding mode, the HALF_EVEN mode is only used in tests. So the main reason we use HALF_EVEN for float -> decimal conversion is to avoid error throwing right?

HALF_EVEN is used OUTSIDE of tests. For example:

https://github.com/apache/hudi/blob/d6ff3d6ba46b51f58ebb1db58f26423a211741f5/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieInternalRowUtils.scala

This line of code was introduced here:

#7769

We'll have to ask @alexeykudinkin

As to why HALF_EVEN was used, I am not sure why...

But if I were to hazard a guess, it is to avoid bias, where numbers > 0.5 will always be rounded up if HALF_UP was used, where it would shift the average value fo the set of numerics.

If HALF_EVEN was used, and there is an equal number of numerics that are > 0.5 that are rounded to the nearest even number in both the UP and DOWN direction, the original average value will not shift as much.

Not really sure... Just throwing my assumptions out there.

danny0405 · 2023-04-11T07:32:00Z

HALF_EVEN is used OUTSIDE of tests. For example:

I'm talking about Apache Spark, not Hudi.

voonhous · 2023-04-11T07:48:12Z

I'm talking about Apache Spark, not Hudi.

Not very sure about this, but I have yet to encounter a use-case where anything other than HALF_UP was used.

xiarixiaoyao · 2023-04-14T09:00:04Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

+   * <ul>
+   *   <li>double/string -> decimal: HALF_UP</li>
+   *   <li>float -> decimal: HALF_EVEN</li>
+   * </ul>


I checked the code of some computing engines

spark always use HALF_UP to cast FractionalType（float/double） to DecimalType
why we use HAL_EVEN to cast float -> decimal
BTW hive/presto also use HALF_UP

@xiarixiaoyao The original code that i am modifying uses HALF_EVEN.

According to the git-blame, the HALF_EVEN rounding mode was introduced by @alexeykudinkin.

This line of code was introduced here:
#7769

https://github.com/onehouseinc/hudi/blob/24020a964671b35fb9aa7b86748771fd71512495/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieInternalRowUtils.scala#L289

I think we should unify all the rounding mode to HALF_UP to avoid ambiguity. And another confusion is #7769, all the numeric and string types use the HALP_EVEN, why this patch only fixes DOUBLE and STRING type, the strategy are out of sync.

My initial considerations for only applying this to DOUBLE and STRING was because these are exact representation of fractional numbers.
i.e. 1.5 = 1.5.

For any other types, like float, these are not exact.
i.e. 1.5 is usually represented as 1.49918237

As such, given that HALF_EVEN was introduced in #7769, i wanted to limit the change to the types to exact types.

Since we want to standardize all rounding mode to HALF_UP, this method is no long required then. I will standardise everything to HALF_UP.

Done, I've standardized the RoundingMode here 8bfaea8

voonhous · 2023-04-17T04:02:17Z

@danny0405 Can you please help to review this PR again? resolved the merge conflicts caused by changes in javadoc.

voonhous · 2023-04-17T11:08:32Z

@hudi-bot run azure

danny0405

+1

hudi-bot · 2023-04-18T13:56:01Z

CI report:

4127079 UNKNOWN
ddf99d1 UNKNOWN
27d6568 UNKNOWN
3bf8c85 UNKNOWN
8bfaea8 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Standardise rounding mode to HALF_UP, engines like Spark, Presto and Hive also use HALF_UP as the default rounding mode.

voonhous force-pushed the HUDI-6033 branch from 5588b8f to 7d131d8 Compare April 5, 2023 03:52

danny0405 assigned xiarixiaoyao Apr 5, 2023

danny0405 added engine:spark Spark integration area:schema Schema evolution and data types labels Apr 5, 2023

voonhous changed the title ~~[HUDI-6033] Fix rounding exception when performing a float to decimal…~~ [HUDI-6033] Fix rounding exception when to decimal casting Apr 5, 2023

xiarixiaoyao approved these changes Apr 6, 2023

View reviewed changes

voonhous force-pushed the HUDI-6033 branch from 8e499ad to 03734ee Compare April 7, 2023 09:24

voonhous force-pushed the HUDI-6033 branch from ddf99d1 to a4898e1 Compare April 10, 2023 08:14

voonhous force-pushed the HUDI-6033 branch from 27d6568 to 95e2664 Compare April 11, 2023 07:09

voonhous force-pushed the HUDI-6033 branch 2 times, most recently from f8183f5 to 09c2c89 Compare April 13, 2023 11:32

xiarixiaoyao reviewed Apr 14, 2023

View reviewed changes

[HUDI-6033] Fix rounding exception when performing to decimal casting

3bf8c85

voonhous force-pushed the HUDI-6033 branch from 09c2c89 to 3bf8c85 Compare April 17, 2023 03:56

Merge branch 'master' into HUDI-6033

07418eb

[HUDI-6033] Standardise rounding mode to HALF_UP

8bfaea8

danny0405 approved these changes Apr 18, 2023

View reviewed changes

danny0405 merged commit bdb50dd into apache:master Apr 19, 2023

yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023

[HUDI-6033] Fix rounding exception when to decimal casting (apache#8380)

e517039

Standardise rounding mode to HALF_UP, engines like Spark, Presto and Hive also use HALF_UP as the default rounding mode.

voonhous deleted the HUDI-6033 branch September 8, 2023 07:11

[HUDI-6033] Fix rounding exception when to decimal casting #8380

[HUDI-6033] Fix rounding exception when to decimal casting #8380

Uh oh!

Conversation

voonhous commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

voonhous commented Apr 4, 2023

Uh oh!

xiarixiaoyao commented Apr 6, 2023

Uh oh!

voonhous commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiarixiaoyao commented Apr 6, 2023

Uh oh!

xiarixiaoyao commented Apr 7, 2023

Uh oh!

voonhous commented Apr 7, 2023

Uh oh!

voonhous commented Apr 10, 2023

Uh oh!

voonhous commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danny0405 commented Apr 11, 2023

Uh oh!

voonhous commented Apr 11, 2023

Uh oh!

danny0405 commented Apr 11, 2023

Uh oh!

voonhous commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danny0405 commented Apr 11, 2023

Uh oh!

voonhous commented Apr 11, 2023

Uh oh!

xiarixiaoyao Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Apr 18, 2023

Choose a reason for hiding this comment

Uh oh!

voonhous Apr 18, 2023

Choose a reason for hiding this comment

Uh oh!

voonhous Apr 18, 2023

Choose a reason for hiding this comment

Uh oh!

voonhous commented Apr 17, 2023

Uh oh!

voonhous commented Apr 17, 2023

Uh oh!

danny0405 left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Apr 18, 2023

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

voonhous commented Apr 4, 2023 •

edited

Loading

voonhous commented Apr 6, 2023 •

edited

Loading

voonhous commented Apr 10, 2023 •

edited

Loading

voonhous commented Apr 11, 2023 •

edited

Loading

xiarixiaoyao Apr 14, 2023 •

edited

Loading

voonhous Apr 14, 2023 •

edited

Loading