Skip to content

Conversation

@chenghao-intel
Copy link
Contributor

val d = Decimal(1.12321)
val df = Seq((d, 1)).toDF("a", "b")
df.selectExpr("b * a / b").collect()  // => Array(Row(null))

Seems we respect the scale part of a decimal more than the integral part, which causes incorrect overflow checking(returns null).

@chenghao-intel
Copy link
Contributor Author

Mark it as WIP, not sure if @davies @rxin has better idea to fix it.

@yhuai
Copy link
Contributor

yhuai commented Aug 25, 2015

what is the datatype of a?

@yhuai
Copy link
Contributor

yhuai commented Aug 25, 2015

ok the type of a is Decimal(38, 18). The plan of df.selectExpr("b * a / b") is

scala> df.selectExpr("b * a / b").explain(true)
== Parsed Logical Plan ==
'Project [(('b * 'a) / 'b) AS ((b * a) / b)#7]
 Project [_1#0 AS a#2,_2#1 AS b#3]
  LocalRelation [_1#0,_2#1], [[1.123210000000000000,1]]

== Analyzed Logical Plan ==
((b * a) / b): decimal(38,38)
Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(cast(b#3 as decimal(10,0)) as decimal(38,18))) * promote_precision(cast(a#2 as decimal(38,18)))), DecimalType(38,18)) as decimal(38,18))) / promote_precision(cast(cast(b#3 as decimal(38,18)) as decimal(38,18)))), DecimalType(38,38)) AS ((b * a) / b)#7]
 Project [_1#0 AS a#2,_2#1 AS b#3]
  LocalRelation [_1#0,_2#1], [[1.123210000000000000,1]]

@chenghao-intel
Copy link
Contributor Author

Yes, the CheckOverflow(xxx, DecimalType(38,38)) will yield null.

@SparkQA
Copy link

SparkQA commented Aug 25, 2015

Test build #41512 has finished for PR 8409 at commit 73e8544.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Aug 25, 2015

promote_precision(cast(cast(b#3 as decimal(38,18)) as decimal(38,18))) looks weird.

The precision and scale will be calculated based on e1 / e2 p1 - s1 + s2 + max(6, s1 + p2 + 1) max(6, s1 + p2 + 1). For this case, even if we use (38, 18) as the precision and scale of b, we will have 38 - 18 + 0 + max(6, 18 + 38 + 1) = 77 as the precision and the max(6, 18 + 38 + 1) = 57 as the scale. To prevent the integral part from being truncated, we should adjust the precision and scale to (38, 18), right? Looks like https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L143 is not right.

@chenghao-intel
Copy link
Contributor Author

Yes, that's why I changed the MAX_SCALE from 38 to 18.

@yhuai
Copy link
Contributor

yhuai commented Aug 25, 2015

If you change MAX_SCALE, we are changing the maximum scale we can support, right? I feel the right solution is fixing https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L143 (if the calculated precision is greater than 38, we set it to 38 and then we adjust scale accordingly to prevent integral part of the number from being truncated). @davies what do you think?

@chenghao-intel
Copy link
Contributor Author

I think the maximum scale we can support should be 18, not 38. As you can see that in SYSTEM_DEFAULT, isn't it? I am not sure if that's the original design, but seems Hive doesn't support the scale as big as 38.

@chenghao-intel
Copy link
Contributor Author

hive> select 10.30000000000000000000001 / 1.420000000000000400000001;
OK
7.253521126760562

@davies
Copy link
Contributor

davies commented Aug 25, 2015

@yhuai For Add/Sub/Multiply, we should not lose precision (we can't reduce the scale of result, prefer overflow), but for division, we can't keep all the precision under zero anyway, then we could have a maximum scale, or preserve the range first.

From MS SQL Server: https://msdn.microsoft.com/en-us/library/ms190476.aspx

* The result precision and scale have an absolute maximum of 38. When a result precision is greater than 38, the corresponding scale is reduced to prevent the integral part of a result from being truncated.

To be safe, I think we should only do this for division, or we will lose precision silently for add/sub/multiply (it does overflow in Hive).

@chenghao-intel
Copy link
Contributor Author

@davis, are you suggesting we should change the code in HiveTypeCoericion.DecimalPrecision for Div only? not anywhere else?

@chenghao-intel
Copy link
Contributor Author

@davies When we performing an division for 2 decimals with type Decimal(38, 18), the result can be Decimal(77, 57), as @yhuai described above, this probably causes problem when we try to get the scale value, if we try to support the maximum value of a decimal, the best value for scale would be 0, however, that's not what we want to for Decimal(1.234) / Decimal(1.235).

That's why I come with the idea to limit the maximum value for scale as 18. Not sure if you have any better idea.

@davies
Copy link
Contributor

davies commented Aug 25, 2015

@chenghao-intel Yes.

This is how precision of division is determined in Hive: https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFOPDivide.java#L113

  /**
   * A balanced way to determine the precision/scale of decimal division result. Integer digits and
   * decimal digits are computed independently. However, when the precision from above reaches above
   * HiveDecimal.MAX_PRECISION, interger digit and decimal digits are shrunk equally to fit.
   */
  @Override
  protected DecimalTypeInfo deriveResultDecimalTypeInfo(int prec1, int scale1, int prec2, int scale2) {
    int intDig = Math.min(HiveDecimal.MAX_SCALE, prec1 - scale1 + scale2);
    int decDig = Math.min(HiveDecimal.MAX_SCALE, Math.max(6, scale1 + prec2 + 1));
    int diff = intDig + decDig -  HiveDecimal.MAX_SCALE;
    if (diff > 0) {
      decDig -= diff/2 + 1; // Slight negative bias.
      intDig = HiveDecimal.MAX_SCALE - decDig;
    }
    return TypeInfoFactory.getDecimalTypeInfo(intDig + decDig, decDig);
  }

@davies
Copy link
Contributor

davies commented Aug 25, 2015

There is another problem here, we promote the precision and scale for integer from (10, 0) to (38, 18), I think we shouldn't do this for add/sub/multiple/division, because we will promote them together in a better way.

@davies
Copy link
Contributor

davies commented Aug 25, 2015

@chenghao-intel We could have a small change like this:

-          val resultType = DecimalType.bounded(p1 - s1 + s2 + max(6, s1 + p2 + 1),
-            max(6, s1 + p2 + 1))
+          val scale = min(max(6, s1 + p2 + 1), 18)
+          val resultType = DecimalType.bounded(p1 - s1 + s2 + scale, scale)

@chenghao-intel
Copy link
Contributor Author

Thanks you all for the reply, I am closing this since #8415 is basically solved all of the open questions.

@chenghao-intel chenghao-intel deleted the decimal_div branch August 25, 2015 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants