Skip to content

Conversation

@Ngone51
Copy link
Member

@Ngone51 Ngone51 commented May 18, 2020

What changes were proposed in this pull request?

Eliminate the UpCast if it's child data type is already decimal type.

Why are the changes needed?

While deserializing internal Decimal value to external BigDecimal(Java/Scala) value, Spark should also respect Decimal's precision and scale, otherwise it will cause precision lost and look weird in some cases, e.g.:

sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d")
  .write.mode("overwrite")
  .parquet(f.getAbsolutePath)

// can fail
spark.read.parquet(f.getAbsolutePath).as[BigDecimal]
[info]   org.apache.spark.sql.AnalysisException: Cannot up cast `d` from decimal(38,0) to decimal(38,18).
[info] The type path of the target object is:
[info] - root class: "scala.math.BigDecimal"
[info] You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
[info]   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)

Does this PR introduce any user-facing change?

Yes, for cases(cause precision lost) mentioned above will fail before this change but run successfully after this change.

How was this patch tested?

Added tests.

@Ngone51
Copy link
Member Author

Ngone51 commented May 18, 2020

cc @cloud-fan @HyukjinKwon

@SparkQA
Copy link

SparkQA commented May 18, 2020

Test build #122812 has finished for PR 28572 at commit b137ec4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 19, 2020

Test build #122833 has finished for PR 28572 at commit b4eb291.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Looks good to me

&& child.dataType.isInstanceOf[DecimalType] =>
assert(walkedTypePath.nonEmpty,
"object DecimalType should only be used inside ExpressionEncoder")
// SPARK-31750: for the case where data type is explicitly known, e.g, spark.read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

SPARK-31750: if we want to upcast to the general decimal type, and the `child` is already
decimal type, we can remove the `Upcast` and accept any precision/scale.
This can happen for cases like `spark.read.parquet("/tmp/file").as[BigDecimal]`.

// eliminate the UpCast here to avoid precision lost.
child

case u @ UpCast(child, _, _)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: case Upcast(child, target: AtomicType, _) if ...

* Cast the child expression to the target data type, but will throw error if the cast might
* truncate, e.g. long -> int, timestamp -> data.
*
* Note that UpCast will be eliminated if the child's dataType is already DecimalType and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify the doc:

Note: `target` is `AbstractDataType`, so that we can put `object DecimalType`, which means we accept
`DecimalType` with any valid precision/scale.

test("SPARK-31750: eliminate UpCast if child's dataType is DecimalType") {
val encoder = ExpressionEncoder[Seq[BigDecimal]]
val attr = Seq(AttributeReference("a", ArrayType(DecimalType(38, 0)))())
// previously, it will fail because Decimal(38, 0) can not be casted to Decimal(38, 18)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously -> Before SPARK-31750


test("SPARK-31750: eliminate UpCast if child's dataType is DecimalType") {
withTempPath { f =>
sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test can still reproduce the bug even if we use 1 instead of 1111...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It depends on the precision/scale rather than the value itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make it shorter.

test("SPARK-31750: eliminate UpCast if child's dataType is DecimalType") {
val encoder = ExpressionEncoder[Seq[BigDecimal]]
val attr = Seq(AttributeReference("a", ArrayType(DecimalType(38, 0)))())
// previously, it will fail because Decimal(38, 0) can not be casted to Decimal(38, 18)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously -> Before SPARK-31750


test("SPARK-31750: eliminate UpCast if child's dataType is DecimalType") {
withTempPath { f =>
sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test can still reproduce the bug even if we use 1 instead of 1111...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've changed it to 1 to simplify the test.

@SparkQA
Copy link

SparkQA commented May 19, 2020

Test build #122838 has finished for PR 28572 at commit 8fe0490.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 19, 2020

Test build #122842 has finished for PR 28572 at commit 6b70e77.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented May 19, 2020

retest this please

@SparkQA
Copy link

SparkQA commented May 19, 2020

Test build #122844 has finished for PR 28572 at commit bc0bbec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 19, 2020

Test build #122850 has finished for PR 28572 at commit e7664a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 19, 2020

Test build #122852 has finished for PR 28572 at commit e7664a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 20, 2020

Merged to master.

I think we can backport this to branch-3.0 if RC2 officially fails.

@HyukjinKwon
Copy link
Member

Okay, seems already failed. I merged to branch-3.0 as well.

HyukjinKwon pushed a commit that referenced this pull request May 20, 2020
### What changes were proposed in this pull request?

Eliminate the `UpCast` if it's child data type is already decimal type.

### Why are the changes needed?

While deserializing internal `Decimal` value to external `BigDecimal`(Java/Scala) value, Spark should also respect `Decimal`'s precision and scale, otherwise it will cause precision lost and look weird in some cases, e.g.:

```
sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d")
  .write.mode("overwrite")
  .parquet(f.getAbsolutePath)

// can fail
spark.read.parquet(f.getAbsolutePath).as[BigDecimal]
```
```
[info]   org.apache.spark.sql.AnalysisException: Cannot up cast `d` from decimal(38,0) to decimal(38,18).
[info] The type path of the target object is:
[info] - root class: "scala.math.BigDecimal"
[info] You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
[info]   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
```

### Does this PR introduce _any_ user-facing change?

Yes, for cases(cause precision lost) mentioned above will fail before this change but run successfully after this change.

### How was this patch tested?

Added tests.

Closes #28572 from Ngone51/fix_encoder.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@Ngone51
Copy link
Member Author

Ngone51 commented May 21, 2020

thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants