[SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method #38731

sadikovi · 2022-11-21T02:37:57Z

What changes were proposed in this pull request?

This PR updates _merge_type method to allow upcast from any AtomicType to StringType similar to Cast.scala (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L297).

This allows us to avoid TypeError errors in the case when it is okay to merge types.
For instance, examples below used to fail with TypeError "Can not merge type ... and ..." but pass with this patch.

spark.createDataFrame([[1.33, 1], ["2.1", 1]])

spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])

It also seems to be okay to merge map keys with different types but I would like to call it out explicitly.

Why are the changes needed?

This makes the behaviour between PySpark and Arrow execution more consistent. For example, arrow can handle type upcasts while PySpark cannot.

Does this PR introduce any user-facing change?

Users may notice that examples with schema inference in PySpark DataFrames, that used to fail due to TypeError, run successfully. This is due to extended type merge handling when one of the values is of StringType.

When merging AtomicType values with StringType values, the final merged type will be StringType. For example, a combination of double and string values in a column would be cast to StringType:

# Both values 1.33 and 2.1 will be cast to strings
spark.createDataFrame([[1.33, 1], ["2.1", 1]])

# Result:
# ["1.33", 1]
# ["2.1", 1]

Note that this also applies to nested types. For example,

Before:

# Throws TypeError "Can not merge type"
spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])

After:

spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])

# Result:
# {"1": true}  1
# {"2": false}  3

How was this patch tested?

I updated the existing unit tests and added a couple of new ones to check that we can upcast to StringType.

sadikovi · 2022-11-21T02:38:24Z

@HyukjinKwon @xinrong-meng Could you review this PR? Thanks.

sadikovi · 2022-11-21T05:22:24Z

@HyukjinKwon I noticed that NullType in PySpark is on the list of atomic types which it is not, in fact, it is mentioned in the type's doc string. However, I tried to remove it but encountered test failures in pyspark.pandas.tests.indexes.test_base. I am going to keep it as is but could you comment on why NullType was added to _atomic_types list? Thanks.

xinrong-meng · 2022-11-22T02:04:59Z

Shall we elaborate Does this PR introduce any user-facing change? (with an example)? The change might be in the 3.4 release note.

sadikovi · 2022-11-22T03:03:29Z

@xinrong-meng I have updated the PR description to clarify the user-facing change.

HyukjinKwon · 2022-11-22T06:45:03Z

Merged to master.

xinrong-meng · 2022-11-23T03:07:39Z

Thanks @sadikovi !

…ethod ### What changes were proposed in this pull request? This PR updates `_merge_type` method to allow upcast from any `AtomicType` to `StringType` similar to Cast.scala (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L297). This allows us to avoid TypeError errors in the case when it is okay to merge types. For instance, examples below used to fail with TypeError "Can not merge type ... and ..." but pass with this patch. ```python spark.createDataFrame([[1.33, 1], ["2.1", 1]]) spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)]) ``` It also seems to be okay to merge map keys with different types but I would like to call it out explicitly. ### Why are the changes needed? This makes the behaviour between PySpark and Arrow execution more consistent. For example, arrow can handle type upcasts while PySpark cannot. ### Does this PR introduce _any_ user-facing change? Users may notice that examples with schema inference in PySpark DataFrames, that used to fail due to TypeError, run successfully. This is due to extended type merge handling when one of the values is of StringType. When merging AtomicType values with StringType values, the final merged type will be StringType. For example, a combination of double and string values in a column would be cast to StringType: ```python # Both values 1.33 and 2.1 will be cast to strings spark.createDataFrame([[1.33, 1], ["2.1", 1]]) # Result: # ["1.33", 1] # ["2.1", 1] ``` Note that this also applies to nested types. For example, Before: ```python # Throws TypeError "Can not merge type" spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)]) ``` After: ```python spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)]) # Result: # {"1": true} 1 # {"2": false} 3 ``` ### How was this patch tested? I updated the existing unit tests and added a couple of new ones to check that we can upcast to StringType. Closes apache#38731 from sadikovi/pyspark-type-inference. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

update code

087d071

github-actions bot added CORE PYTHON SQL labels Nov 21, 2022

sadikovi changed the title ~~[SPARK-41209] Improve PySpark type inference in _merge_type method~~ [SPARK-41209][PYSPARK] Improve PySpark type inference in _merge_type method Nov 21, 2022

revert NullType change

119d81f

fix code style

07fa05b

HyukjinKwon approved these changes Nov 22, 2022

View reviewed changes

xinrong-meng approved these changes Nov 22, 2022

View reviewed changes

xinrong-meng changed the title ~~[SPARK-41209][PYSPARK] Improve PySpark type inference in _merge_type method~~ [SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method Nov 22, 2022

HyukjinKwon closed this in 51b0440 Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method #38731

[SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method #38731

Uh oh!

sadikovi commented Nov 21, 2022 •

edited

Loading

Uh oh!

sadikovi commented Nov 21, 2022

Uh oh!

sadikovi commented Nov 21, 2022

Uh oh!

xinrong-meng commented Nov 22, 2022 •

edited

Loading

Uh oh!

sadikovi commented Nov 22, 2022

Uh oh!

HyukjinKwon commented Nov 22, 2022

Uh oh!

xinrong-meng commented Nov 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method #38731

[SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method #38731

Uh oh!

Conversation

sadikovi commented Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sadikovi commented Nov 21, 2022

Uh oh!

sadikovi commented Nov 21, 2022

Uh oh!

xinrong-meng commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sadikovi commented Nov 22, 2022

Uh oh!

HyukjinKwon commented Nov 22, 2022

Uh oh!

xinrong-meng commented Nov 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sadikovi commented Nov 21, 2022 •

edited

Loading

xinrong-meng commented Nov 22, 2022 •

edited

Loading