-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-31750][SQL] Eliminate UpCast if child's dataType is DecimalType #28572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #122812 has finished for PR 28572 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
|
Test build #122833 has finished for PR 28572 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
|
Looks good to me |
| && child.dataType.isInstanceOf[DecimalType] => | ||
| assert(walkedTypePath.nonEmpty, | ||
| "object DecimalType should only be used inside ExpressionEncoder") | ||
| // SPARK-31750: for the case where data type is explicitly known, e.g, spark.read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
SPARK-31750: if we want to upcast to the general decimal type, and the `child` is already
decimal type, we can remove the `Upcast` and accept any precision/scale.
This can happen for cases like `spark.read.parquet("/tmp/file").as[BigDecimal]`.
| // eliminate the UpCast here to avoid precision lost. | ||
| child | ||
|
|
||
| case u @ UpCast(child, _, _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: case Upcast(child, target: AtomicType, _) if ...
| * Cast the child expression to the target data type, but will throw error if the cast might | ||
| * truncate, e.g. long -> int, timestamp -> data. | ||
| * | ||
| * Note that UpCast will be eliminated if the child's dataType is already DecimalType and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify the doc:
Note: `target` is `AbstractDataType`, so that we can put `object DecimalType`, which means we accept
`DecimalType` with any valid precision/scale.
| test("SPARK-31750: eliminate UpCast if child's dataType is DecimalType") { | ||
| val encoder = ExpressionEncoder[Seq[BigDecimal]] | ||
| val attr = Seq(AttributeReference("a", ArrayType(DecimalType(38, 0)))()) | ||
| // previously, it will fail because Decimal(38, 0) can not be casted to Decimal(38, 18) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously -> Before SPARK-31750
|
|
||
| test("SPARK-31750: eliminate UpCast if child's dataType is DecimalType") { | ||
| withTempPath { f => | ||
| sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test can still reproduce the bug even if we use 1 instead of 1111...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It depends on the precision/scale rather than the value itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make it shorter.
| test("SPARK-31750: eliminate UpCast if child's dataType is DecimalType") { | ||
| val encoder = ExpressionEncoder[Seq[BigDecimal]] | ||
| val attr = Seq(AttributeReference("a", ArrayType(DecimalType(38, 0)))()) | ||
| // previously, it will fail because Decimal(38, 0) can not be casted to Decimal(38, 18) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously -> Before SPARK-31750
|
|
||
| test("SPARK-31750: eliminate UpCast if child's dataType is DecimalType") { | ||
| withTempPath { f => | ||
| sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test can still reproduce the bug even if we use 1 instead of 1111...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've changed it to 1 to simplify the test.
|
Test build #122838 has finished for PR 28572 at commit
|
|
Test build #122842 has finished for PR 28572 at commit
|
|
retest this please |
|
Test build #122844 has finished for PR 28572 at commit
|
|
Test build #122850 has finished for PR 28572 at commit
|
|
Test build #122852 has finished for PR 28572 at commit
|
|
Merged to master. I think we can backport this to branch-3.0 if RC2 officially fails. |
|
Okay, seems already failed. I merged to branch-3.0 as well. |
### What changes were proposed in this pull request?
Eliminate the `UpCast` if it's child data type is already decimal type.
### Why are the changes needed?
While deserializing internal `Decimal` value to external `BigDecimal`(Java/Scala) value, Spark should also respect `Decimal`'s precision and scale, otherwise it will cause precision lost and look weird in some cases, e.g.:
```
sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d")
.write.mode("overwrite")
.parquet(f.getAbsolutePath)
// can fail
spark.read.parquet(f.getAbsolutePath).as[BigDecimal]
```
```
[info] org.apache.spark.sql.AnalysisException: Cannot up cast `d` from decimal(38,0) to decimal(38,18).
[info] The type path of the target object is:
[info] - root class: "scala.math.BigDecimal"
[info] You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
[info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
[info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
[info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
[info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
```
### Does this PR introduce _any_ user-facing change?
Yes, for cases(cause precision lost) mentioned above will fail before this change but run successfully after this change.
### How was this patch tested?
Added tests.
Closes #28572 from Ngone51/fix_encoder.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
|
thanks all! |
What changes were proposed in this pull request?
Eliminate the
UpCastif it's child data type is already decimal type.Why are the changes needed?
While deserializing internal
Decimalvalue to externalBigDecimal(Java/Scala) value, Spark should also respectDecimal's precision and scale, otherwise it will cause precision lost and look weird in some cases, e.g.:Does this PR introduce any user-facing change?
Yes, for cases(cause precision lost) mentioned above will fail before this change but run successfully after this change.
How was this patch tested?
Added tests.