Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented May 4, 2023

What changes were proposed in this pull request?

Makes to call astype to the category type only when the arrow type is not provided.

Why are the changes needed?

Now that the minimum version of pyarrow is 1.0.0, a workaround for pandas' categorical type for pyarrow can be removed if the arrow type is provided.

Note: This can be removed once minimum pyarrow version is >= 0.16.1

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for cleaning this up!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

s = _convert_dict_to_map_items(s)
elif is_categorical_dtype(s.dtype):
# Note: This can be removed once minimum pyarrow version is >= 0.16.1
s = s.astype(s.dtypes.categories.dtype)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler Seems like if t is None, pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck) handles the categorical type as integer (tinyint as its code) instead of some type of s.dtypes.categories.dtype. Is that an expected behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, it will be dictionary<values=string, indices=int8, ordered=0> if we don't specify type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we can't remove it if t is None. or we need to support dictionary type.

Copy link
Member

@BryanCutler BryanCutler May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I remember, for versions >= 0.16.1 pyarrow would automatically cast a categorical type to what the requested type of s was and the result was the correct type without this elif block. This was the comment where it was brought up #26585 (comment)

I remember testing it out locally without this, but that was quite a while ago so something might have changed. That could have also only been the case for when t is not None. So it makes sense to keep it for that case.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the PR title and description accordingly, @ueshin ?

@ueshin ueshin changed the title [SPARK-43363][SQL][PYTHON] Remove a workaround for pandas categorical type for pyarrow [SPARK-43363][SQL][PYTHON] Make to call astype to the category type only when the arrow type is not provided May 4, 2023
@HyukjinKwon
Copy link
Member

Merged to master.

LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request May 10, 2023
… only when the arrow type is not provided

### What changes were proposed in this pull request?

Makes to call `astype` to the category type only when the arrow type is not provided.

### Why are the changes needed?

Now that the minimum version of pyarrow is `1.0.0`, a workaround for pandas' categorical type for pyarrow can be removed if the arrow type is provided.

> Note: This can be removed once minimum pyarrow version is >= 0.16.1

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes apache#41041 from ueshin/issues/SPARK-43363/categorical_type.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants