Skip to content

Conversation

@voonhous
Copy link
Member

@voonhous voonhous commented Apr 4, 2023

Change Logs

Fix rounding exception when there's a loss in precision while performing a XXX to DECIMAL(p, s) unsafe casting where a loss in scale is used.

For stacktrace and detailed explanation of why HALF_EVEN and special cases where HALF_UP rounding scheme was used, please refer to the JIRA ticket here: HUDI-6033

TLDR:
Add rounding mode to to fix rounding exceptions being thrown.

float, double -> decimal: HALF_UP,

Impact

None

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

None

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@voonhous
Copy link
Member Author

voonhous commented Apr 4, 2023

@xiarixiaoyao Can you please help to review this?

Thank you!

@danny0405 danny0405 added engine:spark Spark integration area:schema Schema evolution and data types labels Apr 5, 2023
@voonhous voonhous changed the title [HUDI-6033] Fix rounding exception when performing a float to decimal… [HUDI-6033] Fix rounding exception when to decimal casting Apr 5, 2023
@xiarixiaoyao
Copy link
Contributor

@voonhous
thanks for your contribution.
looks good. only a question, In this scenario where conversion loses accuracy, should we force the conversion or just throw an exception to remind the user

@voonhous
Copy link
Member Author

voonhous commented Apr 6, 2023

@xiarixiaoyao

For conversions that loses accuracy, this really depends on the data that's in the table.

I do not think a check on all existing data is feasible. i.e. checking all existing data to ensure that there is no loss in accuracy, and if there is one, we will need to throw an exception to prevent the ALTER TABLE CHANGE COLUMN TYPE DDL to fail.

IIUC, the only time we can perform this check is when we are doing the ALTER TABLE DDL as a DECIMAL -> float/double schema change (reversing whatever is done) is not allowed.

As such, I believe we will need to document this potential loss in accuracy to remind users of such a behaviour.

@xiarixiaoyao
Copy link
Contributor

@xiarixiaoyao

For conversions that loses accuracy, this really depends on the data that's in the table.

I do not think a check on all existing data is feasible. i.e. checking all existing data to ensure that there is no loss in accuracy, and if there is one, we will need to throw an exception to prevent the ALTER TABLE CHANGE COLUMN TYPE DDL to fail.

IIUC, the only time we can perform this check is when we are doing the ALTER TABLE DDL as a DECIMAL -> float/double schema change (reversing whatever is done) is not allowed.

As such, I believe we will need to document this potential loss in accuracy to remind users of such a behaviour.

doc +1

@xiarixiaoyao
Copy link
Contributor

@hudi-bot run azure

@voonhous
Copy link
Member Author

voonhous commented Apr 7, 2023

@hudi-bot run azure

@voonhous
Copy link
Member Author

@hudi-bot run azure

@voonhous
Copy link
Member Author

voonhous commented Apr 10, 2023

@danny0405 Fixed the CI here i forgot to remove a test case that i was using for debugging.

Can we merge this in soon? Thank you!

After this is fixed, we will need to align the HoodieSparkRecordMerger to ensure that the correct rounding mode is used.
i.e. the output after rewriting to a new schema produces consistent results across the different record merger implementations.

I will raise another PR for it.

@danny0405
Copy link
Contributor

Let's give a summary about the changes:

float -> decimal: HALF_EVEN
double -> decimal: HALF_UP,

Since SPARK uses HALF_UP rounding when performing a FIX_SCALE_TYPE to DECIMAL casting when there is a lost in scale

Can you show us the code snippet how Spark does this then? Is the Spark behavior a standard that we need to follow?

@voonhous
Copy link
Member Author

Let's give a summary about the changes:

float -> decimal: HALF_EVEN double -> decimal: HALF_UP,

Since SPARK uses HALF_UP rounding when performing a FIX_SCALE_TYPE to DECIMAL casting when there is a lost in scale

Can you show us the code snippet how Spark does this then? Is the Spark behavior a standard that we need to follow?

This is the code that i referred to. In general, the default rounding mode is HALF_UP.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

On top of that, I did a simple verification for fixed types which i included in the jira ticket. I'll paste them over here too:

-- test  HALF_UP rounding (verify that it does not use HALF_EVEN)
> SELECT CAST(CAST("10.024" AS DOUBLE) AS DECIMAL(4, 2));
10.02

> SELECT CAST(CAST("10.025" AS DOUBLE) AS DECIMAL(4, 2));
10.03

> SELECT CAST(CAST("10.026" AS DOUBLE) AS DECIMAL(4, 2));
10.03


-- test negative HALF_UP rounding (verify that it does not use HALF_EVEN)
> SELECT CAST(CAST("-10.024" AS DOUBLE) AS DECIMAL(4, 2));
-10.02

> SELECT CAST(CAST("-10.025" AS DOUBLE) AS DECIMAL(4, 2));
-10.03

> SELECT CAST(CAST("-10.026" AS DOUBLE) AS DECIMAL(4, 2));
-10.03


-- test negative HALF_UP rounding (will return same result as HALF_EVEN)
> SELECT CAST(CAST("10.034" AS DOUBLE) AS DECIMAL(4, 2));
10.03

> SELECT CAST(CAST("10.035" AS DOUBLE) AS DECIMAL(4, 2));
10.04

> SELECT CAST(CAST("10.036" AS DOUBLE) AS DECIMAL(4, 2));
10.04


-- test negative HALF_UP rounding (will return same result as HALF_EVEN)
> SELECT CAST(CAST("-10.034" AS DOUBLE) AS DECIMAL(4, 2));
-10.03

> SELECT CAST(CAST("-10.035" AS DOUBLE) AS DECIMAL(4, 2));
-10.04

> SELECT CAST(CAST("-10.036" AS DOUBLE) AS DECIMAL(4, 2));
-10.04

As can be seen from the results i pasted above, general casting rules where scale is lost will use a rounding mode of HALF_UP.

@danny0405
Copy link
Contributor

It seems that Spark always uses the HALF_UP rounding mode, the HALF_EVEN mode is only used in tests. So the main reason we use HALF_EVEN for float -> decimal conversion is to avoid error throwing right?

@voonhous
Copy link
Member Author

voonhous commented Apr 11, 2023

It seems that Spark always uses the HALF_UP rounding mode, the HALF_EVEN mode is only used in tests. So the main reason we use HALF_EVEN for float -> decimal conversion is to avoid error throwing right?

HALF_EVEN is used OUTSIDE of tests. For example:

https://github.com/apache/hudi/blob/d6ff3d6ba46b51f58ebb1db58f26423a211741f5/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieInternalRowUtils.scala

This line of code was introduced here:

#7769

We'll have to ask @alexeykudinkin

As to why HALF_EVEN was used, I am not sure why...

But if I were to hazard a guess, it is to avoid bias, where numbers > 0.5 will always be rounded up if HALF_UP was used, where it would shift the average value fo the set of numerics.

If HALF_EVEN was used, and there is an equal number of numerics that are > 0.5 that are rounded to the nearest even number in both the UP and DOWN direction, the original average value will not shift as much.

Not really sure... Just throwing my assumptions out there.

@danny0405
Copy link
Contributor

HALF_EVEN is used OUTSIDE of tests. For example:

I'm talking about Apache Spark, not Hudi.

@voonhous
Copy link
Member Author

I'm talking about Apache Spark, not Hudi.

Not very sure about this, but I have yet to encounter a use-case where anything other than HALF_UP was used.

@voonhous voonhous force-pushed the HUDI-6033 branch 2 times, most recently from f8183f5 to 09c2c89 Compare April 13, 2023 11:32
* <ul>
* <li>double/string -> decimal: HALF_UP</li>
* <li>float -> decimal: HALF_EVEN</li>
* </ul>
Copy link
Contributor

@xiarixiaoyao xiarixiaoyao Apr 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the code of some computing engines

spark always use HALF_UP to cast FractionalType(float/double) to DecimalType
why we use HAL_EVEN to cast float -> decimal
BTW hive/presto also use HALF_UP

Copy link
Member Author

@voonhous voonhous Apr 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiarixiaoyao The original code that i am modifying uses HALF_EVEN.

According to the git-blame, the HALF_EVEN rounding mode was introduced by @alexeykudinkin.

This line of code was introduced here:
#7769

https://github.com/onehouseinc/hudi/blob/24020a964671b35fb9aa7b86748771fd71512495/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieInternalRowUtils.scala#L289

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should unify all the rounding mode to HALF_UP to avoid ambiguity. And another confusion is #7769, all the numeric and string types use the HALP_EVEN, why this patch only fixes DOUBLE and STRING type, the strategy are out of sync.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial considerations for only applying this to DOUBLE and STRING was because these are exact representation of fractional numbers.
i.e. 1.5 = 1.5.

For any other types, like float, these are not exact.
i.e. 1.5 is usually represented as 1.49918237

As such, given that HALF_EVEN was introduced in #7769, i wanted to limit the change to the types to exact types.

Since we want to standardize all rounding mode to HALF_UP, this method is no long required then. I will standardise everything to HALF_UP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I've standardized the RoundingMode here 8bfaea8

@voonhous
Copy link
Member Author

@danny0405 Can you please help to review this PR again? resolved the merge conflicts caused by changes in javadoc.

@voonhous
Copy link
Member Author

@hudi-bot run azure

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit bdb50dd into apache:master Apr 19, 2023
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 20, 2023
Standardise rounding mode to HALF_UP, engines like Spark, Presto and Hive also use  HALF_UP as the default rounding mode.
yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
Standardise rounding mode to HALF_UP, engines like Spark, Presto and Hive also use  HALF_UP as the default rounding mode.
@voonhous voonhous deleted the HUDI-6033 branch September 8, 2023 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:schema Schema evolution and data types engine:spark Spark integration

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants