[SPARK-31750][SQL] Eliminate UpCast if child's dataType is DecimalType #28572

Ngone51 · 2020-05-18T14:54:57Z

What changes were proposed in this pull request?

Eliminate the UpCast if it's child data type is already decimal type.

Why are the changes needed?

While deserializing internal Decimal value to external BigDecimal(Java/Scala) value, Spark should also respect Decimal's precision and scale, otherwise it will cause precision lost and look weird in some cases, e.g.:

sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d")
  .write.mode("overwrite")
  .parquet(f.getAbsolutePath)

// can fail
spark.read.parquet(f.getAbsolutePath).as[BigDecimal]

[info]   org.apache.spark.sql.AnalysisException: Cannot up cast `d` from decimal(38,0) to decimal(38,18).
[info] The type path of the target object is:
[info] - root class: "scala.math.BigDecimal"
[info] You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
[info]   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)

Does this PR introduce any user-facing change?

Yes, for cases(cause precision lost) mentioned above will fail before this change but run successfully after this change.

How was this patch tested?

Added tests.

Ngone51 · 2020-05-18T15:08:09Z

cc @cloud-fan @HyukjinKwon

SparkQA · 2020-05-18T21:09:09Z

Test build #122812 has finished for PR 28572 at commit b137ec4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

SparkQA · 2020-05-19T07:05:02Z

Test build #122833 has finished for PR 28572 at commit b4eb291.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

HyukjinKwon · 2020-05-19T10:38:03Z

Looks good to me

cloud-fan · 2020-05-19T12:50:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          && child.dataType.isInstanceOf[DecimalType] =>
+          assert(walkedTypePath.nonEmpty,
+            "object DecimalType should only be used inside ExpressionEncoder")
+          // SPARK-31750: for the case where data type is explicitly known, e.g, spark.read


nit:

SPARK-31750: if we want to upcast to the general decimal type, and the `child` is already decimal type, we can remove the `Upcast` and accept any precision/scale. This can happen for cases like `spark.read.parquet("/tmp/file").as[BigDecimal]`.

cloud-fan · 2020-05-19T12:51:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          // eliminate the UpCast here to avoid precision lost.
+          child
+
+        case u @ UpCast(child, _, _)


nit: case Upcast(child, target: AtomicType, _) if ...

cloud-fan · 2020-05-19T12:54:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

 * Cast the child expression to the target data type, but will throw error if the cast might
 * truncate, e.g. long -> int, timestamp -> data.
+ *
+ * Note that UpCast will be eliminated if the child's dataType is already DecimalType and


We can simplify the doc:

Note: `target` is `AbstractDataType`, so that we can put `object DecimalType`, which means we accept `DecimalType` with any valid precision/scale.

cloud-fan · 2020-05-19T12:55:12Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/EncoderResolutionSuite.scala

+  test("SPARK-31750: eliminate UpCast if child's dataType is DecimalType") {
+    val encoder = ExpressionEncoder[Seq[BigDecimal]]
+    val attr = Seq(AttributeReference("a", ArrayType(DecimalType(38, 0)))())
+    // previously, it will fail because Decimal(38, 0) can not be casted to Decimal(38, 18)


previously -> Before SPARK-31750

cloud-fan · 2020-05-19T12:56:55Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+
+  test("SPARK-31750: eliminate UpCast if child's dataType is DecimalType") {
+    withTempPath { f =>
+      sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d")


this test can still reproduce the bug even if we use 1 instead of 1111...?

Yes. It depends on the precision/scale rather than the value itself.

I can make it shorter.

cloud-fan · 2020-05-19T12:59:01Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/EncoderResolutionSuite.scala

+  test("SPARK-31750: eliminate UpCast if child's dataType is DecimalType") {
+    val encoder = ExpressionEncoder[Seq[BigDecimal]]
+    val attr = Seq(AttributeReference("a", ArrayType(DecimalType(38, 0)))())
+    // previously, it will fail because Decimal(38, 0) can not be casted to Decimal(38, 18)


previously -> Before SPARK-31750

cloud-fan · 2020-05-19T13:26:58Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+
+  test("SPARK-31750: eliminate UpCast if child's dataType is DecimalType") {
+    withTempPath { f =>
+      sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d")


this test can still reproduce the bug even if we use 1 instead of 1111...?

Yes, I've changed it to 1 to simplify the test.

SparkQA · 2020-05-19T13:58:33Z

Test build #122838 has finished for PR 28572 at commit 8fe0490.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-19T14:55:57Z

Test build #122842 has finished for PR 28572 at commit 6b70e77.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-05-19T15:34:33Z

retest this please

SparkQA · 2020-05-19T15:51:22Z

Test build #122844 has finished for PR 28572 at commit bc0bbec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-19T19:55:38Z

Test build #122850 has finished for PR 28572 at commit e7664a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-19T21:36:35Z

Test build #122852 has finished for PR 28572 at commit e7664a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-20T02:01:45Z

Merged to master.

I think we can backport this to branch-3.0 if RC2 officially fails.

HyukjinKwon · 2020-05-20T02:20:13Z

Okay, seems already failed. I merged to branch-3.0 as well.

### What changes were proposed in this pull request? Eliminate the `UpCast` if it's child data type is already decimal type. ### Why are the changes needed? While deserializing internal `Decimal` value to external `BigDecimal`(Java/Scala) value, Spark should also respect `Decimal`'s precision and scale, otherwise it will cause precision lost and look weird in some cases, e.g.: ``` sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d") .write.mode("overwrite") .parquet(f.getAbsolutePath) // can fail spark.read.parquet(f.getAbsolutePath).as[BigDecimal] ``` ``` [info] org.apache.spark.sql.AnalysisException: Cannot up cast `d` from decimal(38,0) to decimal(38,18). [info] The type path of the target object is: [info] - root class: "scala.math.BigDecimal" [info] You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object; [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) ``` ### Does this PR introduce _any_ user-facing change? Yes, for cases(cause precision lost) mentioned above will fail before this change but run successfully after this change. ### How was this patch tested? Added tests. Closes #28572 from Ngone51/fix_encoder. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Ngone51 · 2020-05-21T03:06:54Z

thanks all!

Ngone51 added 5 commits May 15, 2020 16:20

fix

a6db5d6

add test

2a9c35a

add test

fae0e54

update

2dc526f

update

b137ec4

probot-autolabeler bot added the SQL label May 18, 2020

cloud-fan reviewed May 19, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 19, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 19, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 19, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

address comment

b4eb291

cloud-fan reviewed May 19, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Outdated Show resolved Hide resolved

Ngone51 added 2 commits May 19, 2020 15:11

update

8fe0490

use assert

6b70e77

HyukjinKwon reviewed May 19, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala Show resolved Hide resolved

HyukjinKwon reviewed May 19, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

update

bc0bbec

cloud-fan reviewed May 19, 2020

View reviewed changes

Ngone51 added 2 commits May 19, 2020 21:24

use 1

75f4f65

before

c0345d6

cloud-fan reviewed May 19, 2020

View reviewed changes

address comments

e7664a1

cloud-fan approved these changes May 19, 2020

View reviewed changes

HyukjinKwon closed this in 0fd98ab May 20, 2020

This was referenced Feb 7, 2022

[SPARK-38126][SQL][TESTS] Check the whole message of error classes #35416

Closed

[SPARK-38131][SQL] Use error classes in user-facing exceptions only #35445

Closed

[SPARK-31750][SQL] Eliminate UpCast if child's dataType is DecimalType #28572

[SPARK-31750][SQL] Eliminate UpCast if child's dataType is DecimalType #28572

Uh oh!

Conversation

Ngone51 commented May 18, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Ngone51 commented May 18, 2020

Uh oh!

SparkQA commented May 18, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented May 19, 2020

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented May 19, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 19, 2020

Uh oh!

SparkQA commented May 19, 2020

Uh oh!

Ngone51 commented May 19, 2020

Uh oh!

SparkQA commented May 19, 2020

Uh oh!

SparkQA commented May 19, 2020

Uh oh!

SparkQA commented May 19, 2020

Uh oh!

HyukjinKwon commented May 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 20, 2020

Uh oh!

Ngone51 commented May 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented May 20, 2020 •

edited

Loading