-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-43363][SQL][PYTHON] Make to call astype to the category type only when the arrow type is not provided
#41041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for cleaning this up!
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
| s = _convert_dict_to_map_items(s) | ||
| elif is_categorical_dtype(s.dtype): | ||
| # Note: This can be removed once minimum pyarrow version is >= 0.16.1 | ||
| s = s.astype(s.dtypes.categories.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler Seems like if t is None, pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck) handles the categorical type as integer (tinyint as its code) instead of some type of s.dtypes.categories.dtype. Is that an expected behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, it will be dictionary<values=string, indices=int8, ordered=0> if we don't specify type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we can't remove it if t is None. or we need to support dictionary type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I remember, for versions >= 0.16.1 pyarrow would automatically cast a categorical type to what the requested type of s was and the result was the correct type without this elif block. This was the comment where it was brought up #26585 (comment)
I remember testing it out locally without this, but that was quite a while ago so something might have changed. That could have also only been the case for when t is not None. So it makes sense to keep it for that case.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the PR title and description accordingly, @ueshin ?
astype to the category type only when the arrow type is not provided
|
Merged to master. |
… only when the arrow type is not provided ### What changes were proposed in this pull request? Makes to call `astype` to the category type only when the arrow type is not provided. ### Why are the changes needed? Now that the minimum version of pyarrow is `1.0.0`, a workaround for pandas' categorical type for pyarrow can be removed if the arrow type is provided. > Note: This can be removed once minimum pyarrow version is >= 0.16.1 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#41041 from ueshin/issues/SPARK-43363/categorical_type. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
Makes to call
astypeto the category type only when the arrow type is not provided.Why are the changes needed?
Now that the minimum version of pyarrow is
1.0.0, a workaround for pandas' categorical type for pyarrow can be removed if the arrow type is provided.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests.