Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Jan 12, 2018

What changes were proposed in this pull request?

This pr fixed the issue when casting UserDefinedTypes into strings;

>>> from pyspark.ml.classification import MultilayerPerceptronClassifier
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(0.0, Vectors.dense([0.0, 0.0])), (1.0, Vectors.dense([0.0, 1.0]))], ["label", "features"])
>>> df.selectExpr("CAST(features AS STRING)").show(truncate = False)
+-------------------------------------------+
|features                                   |
+-------------------------------------------+
|[6,1,0,0,2800000020,2,0,0,0]               |
|[6,1,0,0,2800000020,2,0,0,3ff0000000000000]|
+-------------------------------------------+

The root cause is that Cast handles input data as UserDefinedType.sqlType(this is underlying storage type), so we should pass data into UserDefinedType.deserialize then toString.
This pr modified the result into;

+---------+                                                                     
|features |
+---------+
|[0.0,0.0]|
|[0.0,1.0]|
+---------+

How was this patch tested?

Added tests in UserDefinedTypeSuite .

@ueshin
Copy link
Member

ueshin commented Jan 12, 2018

@maropu I guess you should elaborate the problem of the result string in the description.
The changes LGTM pending Jenkins, btw.

@maropu
Copy link
Member Author

maropu commented Jan 12, 2018

sure!

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86027 has finished for PR 20246 at commit 137d85f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 12, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86033 has finished for PR 20246 at commit 137d85f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 12, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86032 has finished for PR 20246 at commit 137d85f.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 12, 2018

retest this please

builder.append("]")
builder.build()
})
case udt: UserDefinedType[_] =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only place we miss UDT in Cast?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about cast UDT to other types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can strip UDT at L608 and fix this problem entirely.

Copy link
Member Author

@maropu maropu Jan 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here?

protected override def nullSafeEval(input: Any): Any = cast(input)

Since we can cast UDTs to other UDTs or strings only (by checking Cast.canCast), IIUC there is only a place to fix?

def canCast(from: DataType, to: DataType): Boolean = (from, to) match {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, makes sense

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86034 has finished for PR 20246 at commit 137d85f.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86036 has finished for PR 20246 at commit 137d85f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 13, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jan 13, 2018

Test build #86070 has finished for PR 20246 at commit 137d85f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jan 14, 2018

Test build #86126 has finished for PR 20246 at commit 137d85f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3

asfgit pushed a commit that referenced this pull request Jan 15, 2018
…o String

## What changes were proposed in this pull request?
This pr fixed the issue when casting `UserDefinedType`s into strings;
```
>>> from pyspark.ml.classification import MultilayerPerceptronClassifier
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(0.0, Vectors.dense([0.0, 0.0])), (1.0, Vectors.dense([0.0, 1.0]))], ["label", "features"])
>>> df.selectExpr("CAST(features AS STRING)").show(truncate = False)
+-------------------------------------------+
|features                                   |
+-------------------------------------------+
|[6,1,0,0,2800000020,2,0,0,0]               |
|[6,1,0,0,2800000020,2,0,0,3ff0000000000000]|
+-------------------------------------------+
```
The root cause is that `Cast` handles input data as `UserDefinedType.sqlType`(this is underlying storage type), so we should pass data into `UserDefinedType.deserialize` then `toString`.
This pr modified the result into;
```
+---------+
|features |
+---------+
|[0.0,0.0]|
|[0.0,1.0]|
+---------+
```

## How was this patch tested?
Added tests in `UserDefinedTypeSuite `.

Author: Takeshi Yamamuro <[email protected]>

Closes #20246 from maropu/SPARK-23054.

(cherry picked from commit b98ffa4)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in b98ffa4 Jan 15, 2018
asfgit pushed a commit that referenced this pull request Jan 19, 2018
…g PythonUserDefinedType to String.

## What changes were proposed in this pull request?

This is a follow-up of #20246.

If a UDT in Python doesn't have its corresponding Scala UDT, cast to string will be the raw string of the internal value, e.g. `"org.apache.spark.sql.catalyst.expressions.UnsafeArrayDataxxxxxxxx"` if the internal type is `ArrayType`.

This pr fixes it by using its `sqlType` casting.

## How was this patch tested?

Added a test and existing tests.

Author: Takuya UESHIN <[email protected]>

Closes #20306 from ueshin/issues/SPARK-23054/fup1.

(cherry picked from commit 568055d)
Signed-off-by: Wenchen Fan <[email protected]>
asfgit pushed a commit that referenced this pull request Jan 19, 2018
…g PythonUserDefinedType to String.

## What changes were proposed in this pull request?

This is a follow-up of #20246.

If a UDT in Python doesn't have its corresponding Scala UDT, cast to string will be the raw string of the internal value, e.g. `"org.apache.spark.sql.catalyst.expressions.UnsafeArrayDataxxxxxxxx"` if the internal type is `ArrayType`.

This pr fixes it by using its `sqlType` casting.

## How was this patch tested?

Added a test and existing tests.

Author: Takuya UESHIN <[email protected]>

Closes #20306 from ueshin/issues/SPARK-23054/fup1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants