Skip to content

Conversation

@BryanCutler
Copy link
Member

What changes were proposed in this pull request?

When using pyarrow to convert a Pandas categorical column, use is_categorical instead of trying to import CategoricalDtype

Why are the changes needed?

The import for CategoricalDtype had changed from Pandas 0.23 to 1.0 and pyspark currently tries both locations. Using is_categorical is a more stable API.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

@BryanCutler
Copy link
Member Author

@HyukjinKwon I though we could avoid any categorical imports by comparing the dtype with strings, e.g. s.dtype == 'category' but that seems to fail for any other types than categories. It seems is_categorical is the preferred stable api. This change is not that different from what is there currently, so not a big deal if you think this isn't better.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking this way at SPARK-31963. LGTM

from pandas import CategoricalDtype
except ImportError:
from pandas.api.types import CategoricalDtype
from pandas.api.types import is_categorical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thanks.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 11, 2020

BTW, @BryanCutler and @HyukjinKwon .
At the previous PR, I tested pyspark sql only. For the other test suite, it seems to have the similar issues. When I use Pandas 0.23.2, the test fail in my local environment. Could you confirm if master branch is healthy or not at Pandas 0.23.2, please?

@dongjoon-hyun
Copy link
Member

cc @williamhyun , too.

@SparkQA
Copy link

SparkQA commented Jun 11, 2020

Test build #123800 has finished for PR 28793 at commit d2030cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@dongjoon-hyun, do you mean you tested ./run-tests --modules=pyspark-sql I guess? Pandas is SQL only dependency in Spark. The tests look okay fine in my local:

~ pip list | grep pandas
pandas                        0.23.2      /usr/local/lib/python3.7/site-packages
➜  python git:(d2030cf4629) ./run-tests --python-executable=python3 --modules=pyspark-sql
...

What kind of test failure did you meet?

@HyukjinKwon
Copy link
Member

Let me merge this one first anyway :-).

@HyukjinKwon
Copy link
Member

Merged to master.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 11, 2020

I meant pyspark-sql test was okay, but the other module failed when you use pandas 0.23.2 and run python/run-tests.py --python-executables python.

@HyukjinKwon
Copy link
Member

Interesting .. other modules don't use pandas .. let me test it out.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 11, 2020

Thanks. Yes. I agree that it should not affect the other modules~ So, I didn't investigate further.

@HyukjinKwon
Copy link
Member

Got it. I just double checked and all tests passed. If there's an issue with pandas 0.23.2 in other modules, it might be related to numpy which is a dependency in ML/Mllib in PySpark.

@dongjoon-hyun
Copy link
Member

Thank you so much for confirming~ Then, it's okay. I'll check my environment again. I didn't run pyspark test for a long time. :)

BryanCutler pushed a commit that referenced this pull request Oct 21, 2020
…deprecated is_categorical

### What changes were proposed in this pull request?

This PR is a small followup of #28793 and  proposes to use `is_categorical_dtype` instead of deprecated `is_categorical`.

`is_categorical_dtype` exists from minimum pandas version we support (https://github.com/pandas-dev/pandas/blob/v0.23.2/pandas/core/dtypes/api.py), and `is_categorical` was deprecated from pandas 1.1.0 (pandas-dev/pandas@87a1cc2).

### Why are the changes needed?

To avoid using deprecated APIs, and remove warnings.

### Does this PR introduce _any_ user-facing change?

Yes, it will remove warnings that says `is_categorical` is deprecated.

### How was this patch tested?

By running any pandas UDF with pandas 1.1.0+:

```python
import pandas as pd
from pyspark.sql.functions import pandas_udf

def func(x: pd.Series) -> pd.Series:
    return x

spark.range(10).select(pandas_udf(func, "long")("id")).show()
```

Before:

```
/.../python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:151: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
...
```

After:

```
...
```

Closes #30114 from HyukjinKwon/replace-deprecated-is_categorical.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: Bryan Cutler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants