-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method #38731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@HyukjinKwon @xinrong-meng Could you review this PR? Thanks. |
|
@HyukjinKwon I noticed that NullType in PySpark is on the list of atomic types which it is not, in fact, it is mentioned in the type's doc string. However, I tried to remove it but encountered test failures in |
|
Shall we elaborate |
|
@xinrong-meng I have updated the PR description to clarify the user-facing change. |
|
Merged to master. |
|
Thanks @sadikovi ! |
…ethod ### What changes were proposed in this pull request? This PR updates `_merge_type` method to allow upcast from any `AtomicType` to `StringType` similar to Cast.scala (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L297). This allows us to avoid TypeError errors in the case when it is okay to merge types. For instance, examples below used to fail with TypeError "Can not merge type ... and ..." but pass with this patch. ```python spark.createDataFrame([[1.33, 1], ["2.1", 1]]) spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)]) ``` It also seems to be okay to merge map keys with different types but I would like to call it out explicitly. ### Why are the changes needed? This makes the behaviour between PySpark and Arrow execution more consistent. For example, arrow can handle type upcasts while PySpark cannot. ### Does this PR introduce _any_ user-facing change? Users may notice that examples with schema inference in PySpark DataFrames, that used to fail due to TypeError, run successfully. This is due to extended type merge handling when one of the values is of StringType. When merging AtomicType values with StringType values, the final merged type will be StringType. For example, a combination of double and string values in a column would be cast to StringType: ```python # Both values 1.33 and 2.1 will be cast to strings spark.createDataFrame([[1.33, 1], ["2.1", 1]]) # Result: # ["1.33", 1] # ["2.1", 1] ``` Note that this also applies to nested types. For example, Before: ```python # Throws TypeError "Can not merge type" spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)]) ``` After: ```python spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)]) # Result: # {"1": true} 1 # {"2": false} 3 ``` ### How was this patch tested? I updated the existing unit tests and added a couple of new ones to check that we can upcast to StringType. Closes apache#38731 from sadikovi/pyspark-type-inference. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ethod ### What changes were proposed in this pull request? This PR updates `_merge_type` method to allow upcast from any `AtomicType` to `StringType` similar to Cast.scala (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L297). This allows us to avoid TypeError errors in the case when it is okay to merge types. For instance, examples below used to fail with TypeError "Can not merge type ... and ..." but pass with this patch. ```python spark.createDataFrame([[1.33, 1], ["2.1", 1]]) spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)]) ``` It also seems to be okay to merge map keys with different types but I would like to call it out explicitly. ### Why are the changes needed? This makes the behaviour between PySpark and Arrow execution more consistent. For example, arrow can handle type upcasts while PySpark cannot. ### Does this PR introduce _any_ user-facing change? Users may notice that examples with schema inference in PySpark DataFrames, that used to fail due to TypeError, run successfully. This is due to extended type merge handling when one of the values is of StringType. When merging AtomicType values with StringType values, the final merged type will be StringType. For example, a combination of double and string values in a column would be cast to StringType: ```python # Both values 1.33 and 2.1 will be cast to strings spark.createDataFrame([[1.33, 1], ["2.1", 1]]) # Result: # ["1.33", 1] # ["2.1", 1] ``` Note that this also applies to nested types. For example, Before: ```python # Throws TypeError "Can not merge type" spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)]) ``` After: ```python spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)]) # Result: # {"1": true} 1 # {"2": false} 3 ``` ### How was this patch tested? I updated the existing unit tests and added a couple of new ones to check that we can upcast to StringType. Closes apache#38731 from sadikovi/pyspark-type-inference. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ethod ### What changes were proposed in this pull request? This PR updates `_merge_type` method to allow upcast from any `AtomicType` to `StringType` similar to Cast.scala (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L297). This allows us to avoid TypeError errors in the case when it is okay to merge types. For instance, examples below used to fail with TypeError "Can not merge type ... and ..." but pass with this patch. ```python spark.createDataFrame([[1.33, 1], ["2.1", 1]]) spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)]) ``` It also seems to be okay to merge map keys with different types but I would like to call it out explicitly. ### Why are the changes needed? This makes the behaviour between PySpark and Arrow execution more consistent. For example, arrow can handle type upcasts while PySpark cannot. ### Does this PR introduce _any_ user-facing change? Users may notice that examples with schema inference in PySpark DataFrames, that used to fail due to TypeError, run successfully. This is due to extended type merge handling when one of the values is of StringType. When merging AtomicType values with StringType values, the final merged type will be StringType. For example, a combination of double and string values in a column would be cast to StringType: ```python # Both values 1.33 and 2.1 will be cast to strings spark.createDataFrame([[1.33, 1], ["2.1", 1]]) # Result: # ["1.33", 1] # ["2.1", 1] ``` Note that this also applies to nested types. For example, Before: ```python # Throws TypeError "Can not merge type" spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)]) ``` After: ```python spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)]) # Result: # {"1": true} 1 # {"2": false} 3 ``` ### How was this patch tested? I updated the existing unit tests and added a couple of new ones to check that we can upcast to StringType. Closes apache#38731 from sadikovi/pyspark-type-inference. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
This PR updates
_merge_typemethod to allow upcast from anyAtomicTypetoStringTypesimilar to Cast.scala (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L297).This allows us to avoid TypeError errors in the case when it is okay to merge types.
For instance, examples below used to fail with TypeError "Can not merge type ... and ..." but pass with this patch.
It also seems to be okay to merge map keys with different types but I would like to call it out explicitly.
Why are the changes needed?
This makes the behaviour between PySpark and Arrow execution more consistent. For example, arrow can handle type upcasts while PySpark cannot.
Does this PR introduce any user-facing change?
Users may notice that examples with schema inference in PySpark DataFrames, that used to fail due to TypeError, run successfully. This is due to extended type merge handling when one of the values is of StringType.
When merging AtomicType values with StringType values, the final merged type will be StringType. For example, a combination of double and string values in a column would be cast to StringType:
Note that this also applies to nested types. For example,
Before:
After:
How was this patch tested?
I updated the existing unit tests and added a couple of new ones to check that we can upcast to StringType.