[SPARK-25351][SQL][Python] Handle Pandas category type when converting from Python with Arrow #26585

jalpan-randeri · 2019-11-18T22:46:07Z

Handle Pandas category type while converting from python with Arrow enabled. The category column will be converted to whatever type the category elements are as is the case with Arrow disabled.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests were added for createDataFrame and scalar pandas_udf

python/pyspark/serializers.py

BryanCutler

Thanks for doing this @jalpan-randeri , apologies for the delay. I think you will need to add a test for this as well.

python/pyspark/serializers.py

BryanCutler · 2020-01-20T23:44:32Z

ok to test

SparkQA · 2020-01-20T23:59:47Z

Test build #117141 has finished for PR 26585 at commit 6b409e6.

This patch fails build dependency tests.
This patch does not merge cleanly.
This patch adds no public classes.

jalpan-randeri · 2020-01-31T22:30:37Z

Done added test for category type in existing arrow test suit

SparkQA · 2020-01-31T22:38:30Z

Test build #117689 has finished for PR 26585 at commit aa8c5f6.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-01T01:27:04Z

Test build #117688 has finished for PR 26585 at commit d74389c.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2020-02-01T04:22:13Z

Test build #117704 has finished for PR 26585 at commit a8d3173.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-03T02:02:39Z

Test build #117755 has finished for PR 26585 at commit 333f37c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-03T21:59:06Z

Test build #117788 has finished for PR 26585 at commit 1a2447b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jalpan-randeri · 2020-03-24T04:59:56Z

gentle ping.

BryanCutler

Apologies for the delay @jalpan-randeri, things have been busy lately. Could you also add a test in test_pandas_udf_scalar for the case when the user has a pandas_udf with return type 'string' and then returns a categorical string type?

BryanCutler · 2020-04-01T19:22:09Z

python/pyspark/sql/pandas/serializers.py

                s = _check_series_convert_timestamps_internal(s, self._timezone)
            try:
-                array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
+                if type(s.dtype) == CategoricalDtype:


I think you can just use pd.CategoricalDtype and avoid the above import

BryanCutler · 2020-04-01T19:22:20Z

python/pyspark/sql/pandas/serializers.py

-                array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
+                if type(s.dtype) == CategoricalDtype:
+                    s = s.astype(s.dtypes.categories.dtype)
+                    array = pa.array(s)


This doesn't need to be in the try block. Lets follow the above syntax where it checks if it is a timestamp type and do the following:

elif type(s.dtype) == pd.CategoricalDtype: s = s.astype(s.dtypes.categories.dtype)

Also, it looks like this isn't even needed for pyarrow >= 0.16.1 as it will automatically cast to the right type. Could you add a note like "# This can be removed once minimum pyarrow version is >= 0.16.1"?

BryanCutler · 2020-04-01T19:25:49Z

python/pyspark/sql/tests/test_arrow.py

+            df = self.spark.createDataFrame(pdf)
+            result_spark = df.collect()
+
+        assert result_arrow == result_spark


instead of collect, can you to df.toPandas() and then use assert_frame_equal to test for equality?

SparkQA · 2020-05-04T05:37:12Z

Test build #122243 has finished for PR 26585 at commit 99d5026.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-04T17:46:37Z

Test build #122277 has finished for PR 26585 at commit e3246ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jalpan-randeri · 2020-05-22T01:08:37Z

gentle ping.

BryanCutler

@jalpan-randeri this is looking good, only some minor cleanup in the tests before I can merge, thanks for sticking with this!

BryanCutler · 2020-05-22T19:31:32Z

python/pyspark/sql/pandas/serializers.py

        """
        import pandas as pd
        import pyarrow as pa
+


nit: remove newline

BryanCutler · 2020-05-22T19:32:32Z

python/pyspark/sql/pandas/serializers.py

            if t is not None and pa.types.is_timestamp(t):
                s = _check_series_convert_timestamps_internal(s, self._timezone)
+            elif type(s.dtype) == pd.CategoricalDtype:
+                # FIXME: This can be removed once minimum pyarrow version is >= 0.16.1


please change FIXME -> NOTE. It sounds like we are adding broken code, which isn't the case. It's just not needed after a certain version.

BryanCutler · 2020-05-22T19:40:36Z

python/pyspark/sql/tests/test_arrow.py

+            result_spark = df.toPandas()
+
+        assert_frame_equal(result_spark, result_arrow)
+


could you add an assert that the Spark DataFrame has column "B" as a string type?

Done, move other test checks here too.

BryanCutler · 2020-05-22T19:41:52Z

python/pyspark/sql/tests/test_pandas_udf_scalar.py

+        # spark dataframe and arrow execution mode enabled dataframe type must match padnads
+        assert spark_type == arrow_type == 'string'
+        assert isinstance(arrow_first_category_element, str)
+        assert isinstance(spark_first_category_element, str)


Oh yeah, move these to the other test please.

BryanCutler · 2020-05-22T19:48:57Z

python/pyspark/sql/tests/test_pandas_udf_scalar.py

            result = df.withColumn('time', foo_udf(df.time))
            self.assertEquals(df.collect(), result.collect())

+    def test_createDateFrame_with_category_type(self):


This test module is for pandas_udfs, not for createDataFrame. We do need to add a pandas_udf that tests this. The user would specify a return type of string and then return a categorical pandas.Series that has string categories. For example:

@pandas_udf('string') def f(x): return x.astype('category') pdf = pd.DataFrame({"A": [u"a", u"b", u"c", u"a"]}) df = spark.createDataFrame(pdf).withColumn("B", f(col("A"))) result = df.toPandas() # Check result "B" is equal to "A"

aha, got it. fixed

SparkQA · 2020-05-23T22:53:40Z

Test build #123037 has finished for PR 26585 at commit 9f04f1b.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-23T23:33:30Z

Test build #123038 has finished for PR 26585 at commit 7aa9fcd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM. I think the test asserts should probably be from unittest module and not the built-in assert to be consistent, but not a big deal I can fix that up in a followup PR.

BryanCutler · 2020-05-28T00:29:24Z

merged to master, thanks for all the patience to follow through with this @jalpan-randeri !

jalpan-randeri · 2020-05-28T00:55:19Z

wow! Thank you @BryanCutler.

dongjoon-hyun · 2020-06-10T19:58:36Z

Hi, Guys.
Our minimum pandas is 0.23.2. This PR seems to require a higher version of Pandas.
Could you check this PR on 0.23.2?

dongjoon-hyun · 2020-06-10T21:02:13Z

Please see #28789 .

BryanCutler · 2020-06-10T22:47:23Z

That was an oversight on my part using the CategoricalDtype, given that, I think comparing dtype as strings would be better. I could open a PR to fix it up. see https://pandas.pydata.org/pandas-docs/version/0.23.4/categorical.html#equality-semantics

pyarrow dictionary type fix

6b409e6

HyukjinKwon reviewed Nov 19, 2019

View reviewed changes

python/pyspark/serializers.py Outdated Show resolved Hide resolved

dongjoon-hyun added the PYSPARK label Nov 19, 2019

BryanCutler requested changes Jan 16, 2020

View reviewed changes

python/pyspark/serializers.py Outdated Show resolved Hide resolved

python/pyspark/serializers.py Outdated Show resolved Hide resolved

Jalpan Randeri added 4 commits January 30, 2020 23:55

merged from master

5fb9de5

applied dictionary category patch in new oss commits

9733be2

category type conversion tests arrow

d74389c

Merge branch 'temp-oss' into feature-pyarrow-dictionary-type

aa8c5f6

fixed merge conflicts in python for category type support

a8d3173

resolve missing commit from master

333f37c

use with in tests

1a2447b

jalpan-randeri requested review from BryanCutler and HyukjinKwon February 13, 2020 01:05

BryanCutler requested changes Apr 1, 2020

View reviewed changes

Jalpan Randeri added 3 commits May 3, 2020 20:43

removed category import & assert_frame in test

24971d4

scalar udf test

bcb877a

fix me to remove category conversion for pyarrow > 16

99d5026

probot-autolabeler bot added PYTHON SQL labels May 4, 2020

lint fix too many lines in test_udf_scaler

e3246ad

BryanCutler requested changes May 22, 2020

View reviewed changes

cr feedback, proper udf test instead of create dataframe

9f04f1b

python checkstyle fixes

7aa9fcd

BryanCutler approved these changes May 28, 2020

View reviewed changes

BryanCutler changed the title ~~[WIP][SPARK-25351][SQL][Python] Handle Pandas category type when converting from Python with Arrow~~ [SPARK-25351][SQL][Python] Handle Pandas category type when converting from Python with Arrow May 28, 2020

BryanCutler closed this in 339b0ec May 28, 2020

BryanCutler mentioned this pull request May 4, 2023

[SPARK-43363][SQL][PYTHON] Make to call astype to the category type only when the arrow type is not provided #41041

Closed

		result_spark = df.toPandas()

		assert_frame_equal(result_spark, result_arrow)

[SPARK-25351][SQL][Python] Handle Pandas category type when converting from Python with Arrow #26585

[SPARK-25351][SQL][Python] Handle Pandas category type when converting from Python with Arrow #26585

Uh oh!

Conversation

jalpan-randeri commented Nov 18, 2019 • edited by BryanCutler Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

BryanCutler commented Jan 20, 2020

Uh oh!

SparkQA commented Jan 20, 2020

Uh oh!

jalpan-randeri commented Jan 31, 2020

Uh oh!

SparkQA commented Jan 31, 2020

Uh oh!

SparkQA commented Feb 1, 2020

Uh oh!

SparkQA commented Feb 1, 2020

Uh oh!

SparkQA commented Feb 3, 2020

Uh oh!

SparkQA commented Feb 3, 2020

Uh oh!

jalpan-randeri commented Mar 24, 2020

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 4, 2020

Uh oh!

SparkQA commented May 4, 2020

Uh oh!

jalpan-randeri commented May 22, 2020

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 23, 2020

Uh oh!

SparkQA commented May 23, 2020

Uh oh!

BryanCutler left a comment

jalpan-randeri commented Nov 18, 2019 •

edited by BryanCutler

Loading