Skip to content

Conversation

@sadikovi
Copy link
Contributor

@sadikovi sadikovi commented Nov 21, 2022

What changes were proposed in this pull request?

This PR updates _merge_type method to allow upcast from any AtomicType to StringType similar to Cast.scala (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L297).

This allows us to avoid TypeError errors in the case when it is okay to merge types.
For instance, examples below used to fail with TypeError "Can not merge type ... and ..." but pass with this patch.

spark.createDataFrame([[1.33, 1], ["2.1", 1]])

spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])

It also seems to be okay to merge map keys with different types but I would like to call it out explicitly.

Why are the changes needed?

This makes the behaviour between PySpark and Arrow execution more consistent. For example, arrow can handle type upcasts while PySpark cannot.

Does this PR introduce any user-facing change?

Users may notice that examples with schema inference in PySpark DataFrames, that used to fail due to TypeError, run successfully. This is due to extended type merge handling when one of the values is of StringType.

When merging AtomicType values with StringType values, the final merged type will be StringType. For example, a combination of double and string values in a column would be cast to StringType:

# Both values 1.33 and 2.1 will be cast to strings
spark.createDataFrame([[1.33, 1], ["2.1", 1]])

# Result:
# ["1.33", 1]
# ["2.1", 1]

Note that this also applies to nested types. For example,

Before:

# Throws TypeError "Can not merge type"
spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])

After:

spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])

# Result:
# {"1": true}  1
# {"2": false}  3

How was this patch tested?

I updated the existing unit tests and added a couple of new ones to check that we can upcast to StringType.

@sadikovi
Copy link
Contributor Author

@HyukjinKwon @xinrong-meng Could you review this PR? Thanks.

@sadikovi sadikovi changed the title [SPARK-41209] Improve PySpark type inference in _merge_type method [SPARK-41209][PYSPARK] Improve PySpark type inference in _merge_type method Nov 21, 2022
@sadikovi
Copy link
Contributor Author

@HyukjinKwon I noticed that NullType in PySpark is on the list of atomic types which it is not, in fact, it is mentioned in the type's doc string. However, I tried to remove it but encountered test failures in pyspark.pandas.tests.indexes.test_base. I am going to keep it as is but could you comment on why NullType was added to _atomic_types list? Thanks.

@xinrong-meng
Copy link
Member

xinrong-meng commented Nov 22, 2022

Shall we elaborate Does this PR introduce any user-facing change? (with an example)? The change might be in the 3.4 release note.

@xinrong-meng xinrong-meng changed the title [SPARK-41209][PYSPARK] Improve PySpark type inference in _merge_type method [SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method Nov 22, 2022
@sadikovi
Copy link
Contributor Author

@xinrong-meng I have updated the PR description to clarify the user-facing change.

@HyukjinKwon
Copy link
Member

Merged to master.

@xinrong-meng
Copy link
Member

Thanks @sadikovi !

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…ethod

### What changes were proposed in this pull request?

This PR updates `_merge_type` method to allow upcast from any `AtomicType` to `StringType` similar to Cast.scala (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L297).

This allows us to avoid TypeError errors in the case when it is okay to merge types.
For instance, examples below used to fail with TypeError "Can not merge type ... and ..." but pass with this patch.

```python
spark.createDataFrame([[1.33, 1], ["2.1", 1]])

spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])
```

It also seems to be okay to merge map keys with different types but I would like to call it out explicitly.

### Why are the changes needed?

This makes the behaviour between PySpark and Arrow execution more consistent. For example, arrow can handle type upcasts while PySpark cannot.

### Does this PR introduce _any_ user-facing change?

Users may notice that examples with schema inference in PySpark DataFrames, that used to fail due to TypeError, run successfully. This is due to extended type merge handling when one of the values is of StringType.

When merging AtomicType values with StringType values, the final merged type will be StringType. For example, a combination of double and string values in a column would be cast to StringType:

```python
# Both values 1.33 and 2.1 will be cast to strings
spark.createDataFrame([[1.33, 1], ["2.1", 1]])

# Result:
# ["1.33", 1]
# ["2.1", 1]
```

Note that this also applies to nested types. For example,

Before:

```python
# Throws TypeError "Can not merge type"
spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])
```

After:

```python
spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])

# Result:
# {"1": true}  1
# {"2": false}  3
```

### How was this patch tested?

I updated the existing unit tests and added a couple of new ones to check that we can upcast to StringType.

Closes apache#38731 from sadikovi/pyspark-type-inference.

Authored-by: Ivan Sadikov <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
…ethod

### What changes were proposed in this pull request?

This PR updates `_merge_type` method to allow upcast from any `AtomicType` to `StringType` similar to Cast.scala (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L297).

This allows us to avoid TypeError errors in the case when it is okay to merge types.
For instance, examples below used to fail with TypeError "Can not merge type ... and ..." but pass with this patch.

```python
spark.createDataFrame([[1.33, 1], ["2.1", 1]])

spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])
```

It also seems to be okay to merge map keys with different types but I would like to call it out explicitly.

### Why are the changes needed?

This makes the behaviour between PySpark and Arrow execution more consistent. For example, arrow can handle type upcasts while PySpark cannot.

### Does this PR introduce _any_ user-facing change?

Users may notice that examples with schema inference in PySpark DataFrames, that used to fail due to TypeError, run successfully. This is due to extended type merge handling when one of the values is of StringType.

When merging AtomicType values with StringType values, the final merged type will be StringType. For example, a combination of double and string values in a column would be cast to StringType:

```python
# Both values 1.33 and 2.1 will be cast to strings
spark.createDataFrame([[1.33, 1], ["2.1", 1]])

# Result:
# ["1.33", 1]
# ["2.1", 1]
```

Note that this also applies to nested types. For example,

Before:

```python
# Throws TypeError "Can not merge type"
spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])
```

After:

```python
spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])

# Result:
# {"1": true}  1
# {"2": false}  3
```

### How was this patch tested?

I updated the existing unit tests and added a couple of new ones to check that we can upcast to StringType.

Closes apache#38731 from sadikovi/pyspark-type-inference.

Authored-by: Ivan Sadikov <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…ethod

### What changes were proposed in this pull request?

This PR updates `_merge_type` method to allow upcast from any `AtomicType` to `StringType` similar to Cast.scala (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L297).

This allows us to avoid TypeError errors in the case when it is okay to merge types.
For instance, examples below used to fail with TypeError "Can not merge type ... and ..." but pass with this patch.

```python
spark.createDataFrame([[1.33, 1], ["2.1", 1]])

spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])
```

It also seems to be okay to merge map keys with different types but I would like to call it out explicitly.

### Why are the changes needed?

This makes the behaviour between PySpark and Arrow execution more consistent. For example, arrow can handle type upcasts while PySpark cannot.

### Does this PR introduce _any_ user-facing change?

Users may notice that examples with schema inference in PySpark DataFrames, that used to fail due to TypeError, run successfully. This is due to extended type merge handling when one of the values is of StringType.

When merging AtomicType values with StringType values, the final merged type will be StringType. For example, a combination of double and string values in a column would be cast to StringType:

```python
# Both values 1.33 and 2.1 will be cast to strings
spark.createDataFrame([[1.33, 1], ["2.1", 1]])

# Result:
# ["1.33", 1]
# ["2.1", 1]
```

Note that this also applies to nested types. For example,

Before:

```python
# Throws TypeError "Can not merge type"
spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])
```

After:

```python
spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])

# Result:
# {"1": true}  1
# {"2": false}  3
```

### How was this patch tested?

I updated the existing unit tests and added a couple of new ones to check that we can upcast to StringType.

Closes apache#38731 from sadikovi/pyspark-type-inference.

Authored-by: Ivan Sadikov <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants