[SPARK-31964][PYTHON] Use Pandas is_categorical on Arrow category type conversion #28793

BryanCutler · 2020-06-11T00:31:35Z

What changes were proposed in this pull request?

When using pyarrow to convert a Pandas categorical column, use is_categorical instead of trying to import CategoricalDtype

Why are the changes needed?

The import for CategoricalDtype had changed from Pandas 0.23 to 1.0 and pyspark currently tries both locations. Using is_categorical is a more stable API.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

BryanCutler · 2020-06-11T00:35:26Z

@HyukjinKwon I though we could avoid any categorical imports by comparing the dtype with strings, e.g. s.dtype == 'category' but that seems to fail for any other types than categories. It seems is_categorical is the preferred stable api. This change is not that different from what is there currently, so not a big deal if you think this isn't better.

HyukjinKwon

Yeah, I was thinking this way at SPARK-31963. LGTM

dongjoon-hyun · 2020-06-11T00:52:32Z

python/pyspark/sql/pandas/serializers.py

-            from pandas import CategoricalDtype
-        except ImportError:
-            from pandas.api.types import CategoricalDtype
+        from pandas.api.types import is_categorical


dongjoon-hyun

+1, LGTM. Thanks.

dongjoon-hyun · 2020-06-11T00:55:59Z

BTW, @BryanCutler and @HyukjinKwon .
At the previous PR, I tested pyspark sql only. For the other test suite, it seems to have the similar issues. When I use Pandas 0.23.2, the test fail in my local environment. Could you confirm if master branch is healthy or not at Pandas 0.23.2, please?

dongjoon-hyun · 2020-06-11T00:57:02Z

cc @williamhyun , too.

SparkQA · 2020-06-11T01:12:39Z

Test build #123800 has finished for PR 28793 at commit d2030cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-06-11T01:25:47Z

@dongjoon-hyun, do you mean you tested ./run-tests --modules=pyspark-sql I guess? Pandas is SQL only dependency in Spark. The tests look okay fine in my local:

➜  ~ pip list | grep pandas
pandas                        0.23.2      /usr/local/lib/python3.7/site-packages

➜  python git:(d2030cf4629) ./run-tests --python-executable=python3 --modules=pyspark-sql
...

What kind of test failure did you meet?

HyukjinKwon · 2020-06-11T01:26:12Z

Let me merge this one first anyway :-).

HyukjinKwon · 2020-06-11T01:27:07Z

Merged to master.

dongjoon-hyun · 2020-06-11T01:37:34Z

I meant pyspark-sql test was okay, but the other module failed when you use pandas 0.23.2 and run python/run-tests.py --python-executables python.

HyukjinKwon · 2020-06-11T01:47:09Z

Interesting .. other modules don't use pandas .. let me test it out.

dongjoon-hyun · 2020-06-11T02:11:04Z

Thanks. Yes. I agree that it should not affect the other modules~ So, I didn't investigate further.

HyukjinKwon · 2020-06-11T02:15:58Z

Got it. I just double checked and all tests passed. If there's an issue with pandas 0.23.2 in other modules, it might be related to numpy which is a dependency in ML/Mllib in PySpark.

dongjoon-hyun · 2020-06-11T02:43:58Z

Thank you so much for confirming~ Then, it's okay. I'll check my environment again. I didn't run pyspark test for a long time. :)

…deprecated is_categorical ### What changes were proposed in this pull request? This PR is a small followup of #28793 and proposes to use `is_categorical_dtype` instead of deprecated `is_categorical`. `is_categorical_dtype` exists from minimum pandas version we support (https://github.com/pandas-dev/pandas/blob/v0.23.2/pandas/core/dtypes/api.py), and `is_categorical` was deprecated from pandas 1.1.0 (pandas-dev/pandas@87a1cc2). ### Why are the changes needed? To avoid using deprecated APIs, and remove warnings. ### Does this PR introduce _any_ user-facing change? Yes, it will remove warnings that says `is_categorical` is deprecated. ### How was this patch tested? By running any pandas UDF with pandas 1.1.0+: ```python import pandas as pd from pyspark.sql.functions import pandas_udf def func(x: pd.Series) -> pd.Series: return x spark.range(10).select(pandas_udf(func, "long")("id")).show() ``` Before: ``` /.../python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:151: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead ... ``` After: ``` ... ``` Closes #30114 from HyukjinKwon/replace-deprecated-is_categorical. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Bryan Cutler <[email protected]>

Use is_categorical instead of importing CategoricalDtype

d2030cf

HyukjinKwon approved these changes Jun 11, 2020

View reviewed changes

dongjoon-hyun reviewed Jun 11, 2020

View reviewed changes

dongjoon-hyun approved these changes Jun 11, 2020

View reviewed changes

HyukjinKwon closed this in b7ef529 Jun 11, 2020

HyukjinKwon mentioned this pull request Oct 21, 2020

[SPARK-31964][PYTHON][FOLLOW-UP] Use is_categorical_dtype instead of deprecated is_categorical #30114

Closed

[SPARK-31964][PYTHON] Use Pandas is_categorical on Arrow category type conversion #28793

[SPARK-31964][PYTHON] Use Pandas is_categorical on Arrow category type conversion #28793

Uh oh!

Conversation

BryanCutler commented Jun 11, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

BryanCutler commented Jun 11, 2020

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 11, 2020

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jun 11, 2020

Uh oh!

SparkQA commented Jun 11, 2020

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

dongjoon-hyun commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

dongjoon-hyun commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

dongjoon-hyun commented Jun 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dongjoon-hyun commented Jun 11, 2020 •

edited

Loading

dongjoon-hyun commented Jun 11, 2020 •

edited

Loading

dongjoon-hyun commented Jun 11, 2020 •

edited

Loading