Skip to content

Conversation

@jalpan-randeri
Copy link

@jalpan-randeri jalpan-randeri commented Nov 18, 2019

Handle Pandas category type while converting from python with Arrow enabled. The category column will be converted to whatever type the category elements are as is the case with Arrow disabled.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests were added for createDataFrame and scalar pandas_udf

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @jalpan-randeri , apologies for the delay. I think you will need to add a test for this as well.

@BryanCutler
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117141 has finished for PR 26585 at commit 6b409e6.

  • This patch fails build dependency tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@jalpan-randeri
Copy link
Author

Done added test for category type in existing arrow test suit

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117689 has finished for PR 26585 at commit aa8c5f6.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 1, 2020

Test build #117688 has finished for PR 26585 at commit d74389c.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 1, 2020

Test build #117704 has finished for PR 26585 at commit a8d3173.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 3, 2020

Test build #117755 has finished for PR 26585 at commit 333f37c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 3, 2020

Test build #117788 has finished for PR 26585 at commit 1a2447b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jalpan-randeri
Copy link
Author

gentle ping.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay @jalpan-randeri, things have been busy lately. Could you also add a test in test_pandas_udf_scalar for the case when the user has a pandas_udf with return type 'string' and then returns a categorical string type?

s = _check_series_convert_timestamps_internal(s, self._timezone)
try:
array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
if type(s.dtype) == CategoricalDtype:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just use pd.CategoricalDtype and avoid the above import

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
if type(s.dtype) == CategoricalDtype:
s = s.astype(s.dtypes.categories.dtype)
array = pa.array(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be in the try block. Lets follow the above syntax where it checks if it is a timestamp type and do the following:

elif type(s.dtype) == pd.CategoricalDtype:
  s = s.astype(s.dtypes.categories.dtype)

Also, it looks like this isn't even needed for pyarrow >= 0.16.1 as it will automatically cast to the right type. Could you add a note like "# This can be removed once minimum pyarrow version is >= 0.16.1"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

df = self.spark.createDataFrame(pdf)
result_spark = df.collect()

assert result_arrow == result_spark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of collect, can you to df.toPandas() and then use assert_frame_equal to test for equality?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@SparkQA
Copy link

SparkQA commented May 4, 2020

Test build #122243 has finished for PR 26585 at commit 99d5026.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2020

Test build #122277 has finished for PR 26585 at commit e3246ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jalpan-randeri
Copy link
Author

gentle ping.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jalpan-randeri this is looking good, only some minor cleanup in the tests before I can merge, thanks for sticking with this!

"""
import pandas as pd
import pyarrow as pa

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove newline

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if t is not None and pa.types.is_timestamp(t):
s = _check_series_convert_timestamps_internal(s, self._timezone)
elif type(s.dtype) == pd.CategoricalDtype:
# FIXME: This can be removed once minimum pyarrow version is >= 0.16.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change FIXME -> NOTE. It sounds like we are adding broken code, which isn't the case. It's just not needed after a certain version.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

result_spark = df.toPandas()

assert_frame_equal(result_spark, result_arrow)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add an assert that the Spark DataFrame has column "B" as a string type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, move other test checks here too.

# spark dataframe and arrow execution mode enabled dataframe type must match padnads
assert spark_type == arrow_type == 'string'
assert isinstance(arrow_first_category_element, str)
assert isinstance(spark_first_category_element, str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, move these to the other test please.

result = df.withColumn('time', foo_udf(df.time))
self.assertEquals(df.collect(), result.collect())

def test_createDateFrame_with_category_type(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test module is for pandas_udfs, not for createDataFrame. We do need to add a pandas_udf that tests this. The user would specify a return type of string and then return a categorical pandas.Series that has string categories. For example:

@pandas_udf('string')
def f(x):
    return x.astype('category')

pdf = pd.DataFrame({"A": [u"a", u"b", u"c", u"a"]})
df = spark.createDataFrame(pdf).withColumn("B", f(col("A")))
result = df.toPandas()
# Check result "B" is equal to "A"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, got it. fixed

@SparkQA
Copy link

SparkQA commented May 23, 2020

Test build #123037 has finished for PR 26585 at commit 9f04f1b.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 23, 2020

Test build #123038 has finished for PR 26585 at commit 7aa9fcd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think the test asserts should probably be from unittest module and not the built-in assert to be consistent, but not a big deal I can fix that up in a followup PR.

@BryanCutler BryanCutler changed the title [WIP][SPARK-25351][SQL][Python] Handle Pandas category type when converting from Python with Arrow [SPARK-25351][SQL][Python] Handle Pandas category type when converting from Python with Arrow May 28, 2020
@BryanCutler
Copy link
Member

merged to master, thanks for all the patience to follow through with this @jalpan-randeri !

@jalpan-randeri
Copy link
Author

wow! Thank you @BryanCutler.

@dongjoon-hyun
Copy link
Member

Hi, Guys.
Our minimum pandas is 0.23.2. This PR seems to require a higher version of Pandas.
Could you check this PR on 0.23.2?

@dongjoon-hyun
Copy link
Member

Please see #28789 .

@BryanCutler
Copy link
Member

That was an oversight on my part using the CategoricalDtype, given that, I think comparing dtype as strings would be better. I could open a PR to fix it up. see https://pandas.pydata.org/pandas-docs/version/0.23.4/categorical.html#equality-semantics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants