[SPARK-10215][SQL][WIP]Div of Decimal returns null #8409

chenghao-intel · 2015-08-25T03:14:19Z

val d = Decimal(1.12321)
val df = Seq((d, 1)).toDF("a", "b")
df.selectExpr("b * a / b").collect()  // => Array(Row(null))

Seems we respect the scale part of a decimal more than the integral part, which causes incorrect overflow checking(returns null).

chenghao-intel · 2015-08-25T03:16:12Z

Mark it as WIP, not sure if @davies @rxin has better idea to fix it.

yhuai · 2015-08-25T04:52:33Z

what is the datatype of a?

yhuai · 2015-08-25T05:02:04Z

ok the type of a is Decimal(38, 18). The plan of df.selectExpr("b * a / b") is

scala> df.selectExpr("b * a / b").explain(true)
== Parsed Logical Plan ==
'Project [(('b * 'a) / 'b) AS ((b * a) / b)#7]
 Project [_1#0 AS a#2,_2#1 AS b#3]
  LocalRelation [_1#0,_2#1], [[1.123210000000000000,1]]

== Analyzed Logical Plan ==
((b * a) / b): decimal(38,38)
Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(cast(b#3 as decimal(10,0)) as decimal(38,18))) * promote_precision(cast(a#2 as decimal(38,18)))), DecimalType(38,18)) as decimal(38,18))) / promote_precision(cast(cast(b#3 as decimal(38,18)) as decimal(38,18)))), DecimalType(38,38)) AS ((b * a) / b)#7]
 Project [_1#0 AS a#2,_2#1 AS b#3]
  LocalRelation [_1#0,_2#1], [[1.123210000000000000,1]]

chenghao-intel · 2015-08-25T05:14:33Z

Yes, the CheckOverflow(xxx, DecimalType(38,38)) will yield null.

SparkQA · 2015-08-25T05:19:30Z

Test build #41512 has finished for PR 8409 at commit 73e8544.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-08-25T05:22:55Z

promote_precision(cast(cast(b#3 as decimal(38,18)) as decimal(38,18))) looks weird.

The precision and scale will be calculated based on e1 / e2 p1 - s1 + s2 + max(6, s1 + p2 + 1) max(6, s1 + p2 + 1). For this case, even if we use (38, 18) as the precision and scale of b, we will have 38 - 18 + 0 + max(6, 18 + 38 + 1) = 77 as the precision and the max(6, 18 + 38 + 1) = 57 as the scale. To prevent the integral part from being truncated, we should adjust the precision and scale to (38, 18), right? Looks like https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L143 is not right.

chenghao-intel · 2015-08-25T05:26:02Z

Yes, that's why I changed the MAX_SCALE from 38 to 18.

yhuai · 2015-08-25T05:31:06Z

If you change MAX_SCALE, we are changing the maximum scale we can support, right? I feel the right solution is fixing https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L143 (if the calculated precision is greater than 38, we set it to 38 and then we adjust scale accordingly to prevent integral part of the number from being truncated). @davies what do you think?

chenghao-intel · 2015-08-25T05:42:55Z

I think the maximum scale we can support should be 18, not 38. As you can see that in SYSTEM_DEFAULT, isn't it? I am not sure if that's the original design, but seems Hive doesn't support the scale as big as 38.

chenghao-intel · 2015-08-25T05:44:14Z

hive> select 10.30000000000000000000001 / 1.420000000000000400000001;
OK
7.253521126760562

davies · 2015-08-25T05:48:58Z

@yhuai For Add/Sub/Multiply, we should not lose precision (we can't reduce the scale of result, prefer overflow), but for division, we can't keep all the precision under zero anyway, then we could have a maximum scale, or preserve the range first.

From MS SQL Server： https://msdn.microsoft.com/en-us/library/ms190476.aspx

* The result precision and scale have an absolute maximum of 38. When a result precision is greater than 38, the corresponding scale is reduced to prevent the integral part of a result from being truncated.

To be safe, I think we should only do this for division, or we will lose precision silently for add/sub/multiply (it does overflow in Hive).

chenghao-intel · 2015-08-25T05:56:11Z

@davis, are you suggesting we should change the code in HiveTypeCoericion.DecimalPrecision for Div only? not anywhere else?

chenghao-intel · 2015-08-25T06:10:49Z

@davies When we performing an division for 2 decimals with type Decimal(38, 18), the result can be Decimal(77, 57), as @yhuai described above, this probably causes problem when we try to get the scale value, if we try to support the maximum value of a decimal, the best value for scale would be 0, however, that's not what we want to for Decimal(1.234) / Decimal(1.235).

That's why I come with the idea to limit the maximum value for scale as 18. Not sure if you have any better idea.

davies · 2015-08-25T06:12:08Z

@chenghao-intel Yes.

This is how precision of division is determined in Hive: https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFOPDivide.java#L113

  /**
   * A balanced way to determine the precision/scale of decimal division result. Integer digits and
   * decimal digits are computed independently. However, when the precision from above reaches above
   * HiveDecimal.MAX_PRECISION, interger digit and decimal digits are shrunk equally to fit.
   */
  @Override
  protected DecimalTypeInfo deriveResultDecimalTypeInfo(int prec1, int scale1, int prec2, int scale2) {
    int intDig = Math.min(HiveDecimal.MAX_SCALE, prec1 - scale1 + scale2);
    int decDig = Math.min(HiveDecimal.MAX_SCALE, Math.max(6, scale1 + prec2 + 1));
    int diff = intDig + decDig -  HiveDecimal.MAX_SCALE;
    if (diff > 0) {
      decDig -= diff/2 + 1; // Slight negative bias.
      intDig = HiveDecimal.MAX_SCALE - decDig;
    }
    return TypeInfoFactory.getDecimalTypeInfo(intDig + decDig, decDig);
  }

davies · 2015-08-25T06:15:22Z

There is another problem here, we promote the precision and scale for integer from (10, 0) to (38, 18), I think we shouldn't do this for add/sub/multiple/division, because we will promote them together in a better way.

davies · 2015-08-25T07:04:54Z

@chenghao-intel We could have a small change like this:

-          val resultType = DecimalType.bounded(p1 - s1 + s2 + max(6, s1 + p2 + 1),
-            max(6, s1 + p2 + 1))
+          val scale = min(max(6, s1 + p2 + 1), 18)
+          val resultType = DecimalType.bounded(p1 - s1 + s2 + scale, scale)

chenghao-intel · 2015-08-25T08:15:55Z

Thanks you all for the reply, I am closing this since #8415 is basically solved all of the open questions.

change the MAX_SCALE for decimal

73e8544

chenghao-intel closed this Aug 25, 2015

chenghao-intel deleted the decimal_div branch August 25, 2015 08:15

[SPARK-10215][SQL][WIP]Div of Decimal returns null #8409

[SPARK-10215][SQL][WIP]Div of Decimal returns null #8409

Uh oh!

Conversation

chenghao-intel commented Aug 25, 2015

Uh oh!

chenghao-intel commented Aug 25, 2015

Uh oh!

yhuai commented Aug 25, 2015

Uh oh!

yhuai commented Aug 25, 2015

Uh oh!

chenghao-intel commented Aug 25, 2015

Uh oh!

SparkQA commented Aug 25, 2015

Uh oh!

yhuai commented Aug 25, 2015

Uh oh!

chenghao-intel commented Aug 25, 2015

Uh oh!

yhuai commented Aug 25, 2015

Uh oh!

chenghao-intel commented Aug 25, 2015

Uh oh!

chenghao-intel commented Aug 25, 2015

Uh oh!

davies commented Aug 25, 2015

Uh oh!

chenghao-intel commented Aug 25, 2015

Uh oh!

chenghao-intel commented Aug 25, 2015

Uh oh!

davies commented Aug 25, 2015

Uh oh!

davies commented Aug 25, 2015

Uh oh!

davies commented Aug 25, 2015

Uh oh!

chenghao-intel commented Aug 25, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants