-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23054][SQL] Fix incorrect results of casting UserDefinedType to String #20246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@maropu I guess you should elaborate the problem of the result string in the description. |
|
sure! |
|
Test build #86027 has finished for PR 20246 at commit
|
|
retest this please |
|
Test build #86033 has finished for PR 20246 at commit
|
|
retest this please |
|
Test build #86032 has finished for PR 20246 at commit
|
|
retest this please |
| builder.append("]") | ||
| builder.build() | ||
| }) | ||
| case udt: UserDefinedType[_] => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only place we miss UDT in Cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about cast UDT to other types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can strip UDT at L608 and fix this problem entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here?
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Line 608 in cd9f49a
| protected override def nullSafeEval(input: Any): Any = cast(input) |
Since we can cast UDTs to other UDTs or strings only (by checking Cast.canCast), IIUC there is only a place to fix?
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Line 36 in cd9f49a
| def canCast(from: DataType, to: DataType): Boolean = (from, to) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see, makes sense
|
Test build #86034 has finished for PR 20246 at commit
|
|
Test build #86036 has finished for PR 20246 at commit
|
|
retest this please |
|
Test build #86070 has finished for PR 20246 at commit
|
|
Retest this please. |
|
Test build #86126 has finished for PR 20246 at commit
|
|
thanks, merging to master/2.3 |
…o String
## What changes were proposed in this pull request?
This pr fixed the issue when casting `UserDefinedType`s into strings;
```
>>> from pyspark.ml.classification import MultilayerPerceptronClassifier
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(0.0, Vectors.dense([0.0, 0.0])), (1.0, Vectors.dense([0.0, 1.0]))], ["label", "features"])
>>> df.selectExpr("CAST(features AS STRING)").show(truncate = False)
+-------------------------------------------+
|features |
+-------------------------------------------+
|[6,1,0,0,2800000020,2,0,0,0] |
|[6,1,0,0,2800000020,2,0,0,3ff0000000000000]|
+-------------------------------------------+
```
The root cause is that `Cast` handles input data as `UserDefinedType.sqlType`(this is underlying storage type), so we should pass data into `UserDefinedType.deserialize` then `toString`.
This pr modified the result into;
```
+---------+
|features |
+---------+
|[0.0,0.0]|
|[0.0,1.0]|
+---------+
```
## How was this patch tested?
Added tests in `UserDefinedTypeSuite `.
Author: Takeshi Yamamuro <[email protected]>
Closes #20246 from maropu/SPARK-23054.
(cherry picked from commit b98ffa4)
Signed-off-by: Wenchen Fan <[email protected]>
…g PythonUserDefinedType to String. ## What changes were proposed in this pull request? This is a follow-up of #20246. If a UDT in Python doesn't have its corresponding Scala UDT, cast to string will be the raw string of the internal value, e.g. `"org.apache.spark.sql.catalyst.expressions.UnsafeArrayDataxxxxxxxx"` if the internal type is `ArrayType`. This pr fixes it by using its `sqlType` casting. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <[email protected]> Closes #20306 from ueshin/issues/SPARK-23054/fup1. (cherry picked from commit 568055d) Signed-off-by: Wenchen Fan <[email protected]>
…g PythonUserDefinedType to String. ## What changes were proposed in this pull request? This is a follow-up of #20246. If a UDT in Python doesn't have its corresponding Scala UDT, cast to string will be the raw string of the internal value, e.g. `"org.apache.spark.sql.catalyst.expressions.UnsafeArrayDataxxxxxxxx"` if the internal type is `ArrayType`. This pr fixes it by using its `sqlType` casting. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <[email protected]> Closes #20306 from ueshin/issues/SPARK-23054/fup1.
What changes were proposed in this pull request?
This pr fixed the issue when casting
UserDefinedTypes into strings;The root cause is that
Casthandles input data asUserDefinedType.sqlType(this is underlying storage type), so we should pass data intoUserDefinedType.deserializethentoString.This pr modified the result into;
How was this patch tested?
Added tests in
UserDefinedTypeSuite.