-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31964][PYTHON] Use Pandas is_categorical on Arrow category type conversion #28793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31964][PYTHON] Use Pandas is_categorical on Arrow category type conversion #28793
Conversation
|
@HyukjinKwon I though we could avoid any categorical imports by comparing the dtype with strings, e.g. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking this way at SPARK-31963. LGTM
| from pandas import CategoricalDtype | ||
| except ImportError: | ||
| from pandas.api.types import CategoricalDtype | ||
| from pandas.api.types import is_categorical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice!
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thanks.
|
BTW, @BryanCutler and @HyukjinKwon . |
|
cc @williamhyun , too. |
|
Test build #123800 has finished for PR 28793 at commit
|
|
@dongjoon-hyun, do you mean you tested ➜ ~ pip list | grep pandas
pandas 0.23.2 /usr/local/lib/python3.7/site-packages➜ python git:(d2030cf4629) ./run-tests --python-executable=python3 --modules=pyspark-sql
...What kind of test failure did you meet? |
|
Let me merge this one first anyway :-). |
|
Merged to master. |
|
I meant |
|
Interesting .. other modules don't use pandas .. let me test it out. |
|
Thanks. Yes. I agree that it should not affect the other modules~ So, I didn't investigate further. |
|
Got it. I just double checked and all tests passed. If there's an issue with pandas 0.23.2 in other modules, it might be related to numpy which is a dependency in ML/Mllib in PySpark. |
|
Thank you so much for confirming~ Then, it's okay. I'll check my environment again. I didn't run pyspark test for a long time. :) |
…deprecated is_categorical ### What changes were proposed in this pull request? This PR is a small followup of #28793 and proposes to use `is_categorical_dtype` instead of deprecated `is_categorical`. `is_categorical_dtype` exists from minimum pandas version we support (https://github.com/pandas-dev/pandas/blob/v0.23.2/pandas/core/dtypes/api.py), and `is_categorical` was deprecated from pandas 1.1.0 (pandas-dev/pandas@87a1cc2). ### Why are the changes needed? To avoid using deprecated APIs, and remove warnings. ### Does this PR introduce _any_ user-facing change? Yes, it will remove warnings that says `is_categorical` is deprecated. ### How was this patch tested? By running any pandas UDF with pandas 1.1.0+: ```python import pandas as pd from pyspark.sql.functions import pandas_udf def func(x: pd.Series) -> pd.Series: return x spark.range(10).select(pandas_udf(func, "long")("id")).show() ``` Before: ``` /.../python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:151: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead ... ``` After: ``` ... ``` Closes #30114 from HyukjinKwon/replace-deprecated-is_categorical. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Bryan Cutler <[email protected]>
What changes were proposed in this pull request?
When using pyarrow to convert a Pandas categorical column, use
is_categoricalinstead of trying to importCategoricalDtypeWhy are the changes needed?
The import for
CategoricalDtypehad changed from Pandas 0.23 to 1.0 and pyspark currently tries both locations. Usingis_categoricalis a more stable API.Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests