-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25351][SQL][Python] Handle Pandas category type when converting from Python with Arrow #26585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25351][SQL][Python] Handle Pandas category type when converting from Python with Arrow #26585
Conversation
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @jalpan-randeri , apologies for the delay. I think you will need to add a test for this as well.
|
ok to test |
|
Test build #117141 has finished for PR 26585 at commit
|
|
Done added test for category type in existing arrow test suit |
|
Test build #117689 has finished for PR 26585 at commit
|
|
Test build #117688 has finished for PR 26585 at commit
|
|
Test build #117704 has finished for PR 26585 at commit
|
|
Test build #117755 has finished for PR 26585 at commit
|
|
Test build #117788 has finished for PR 26585 at commit
|
|
gentle ping. |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay @jalpan-randeri, things have been busy lately. Could you also add a test in test_pandas_udf_scalar for the case when the user has a pandas_udf with return type 'string' and then returns a categorical string type?
| s = _check_series_convert_timestamps_internal(s, self._timezone) | ||
| try: | ||
| array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck) | ||
| if type(s.dtype) == CategoricalDtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just use pd.CategoricalDtype and avoid the above import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck) | ||
| if type(s.dtype) == CategoricalDtype: | ||
| s = s.astype(s.dtypes.categories.dtype) | ||
| array = pa.array(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be in the try block. Lets follow the above syntax where it checks if it is a timestamp type and do the following:
elif type(s.dtype) == pd.CategoricalDtype:
s = s.astype(s.dtypes.categories.dtype)Also, it looks like this isn't even needed for pyarrow >= 0.16.1 as it will automatically cast to the right type. Could you add a note like "# This can be removed once minimum pyarrow version is >= 0.16.1"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| df = self.spark.createDataFrame(pdf) | ||
| result_spark = df.collect() | ||
|
|
||
| assert result_arrow == result_spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of collect, can you to df.toPandas() and then use assert_frame_equal to test for equality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
Test build #122243 has finished for PR 26585 at commit
|
|
Test build #122277 has finished for PR 26585 at commit
|
|
gentle ping. |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jalpan-randeri this is looking good, only some minor cleanup in the tests before I can merge, thanks for sticking with this!
| """ | ||
| import pandas as pd | ||
| import pyarrow as pa | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| if t is not None and pa.types.is_timestamp(t): | ||
| s = _check_series_convert_timestamps_internal(s, self._timezone) | ||
| elif type(s.dtype) == pd.CategoricalDtype: | ||
| # FIXME: This can be removed once minimum pyarrow version is >= 0.16.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please change FIXME -> NOTE. It sounds like we are adding broken code, which isn't the case. It's just not needed after a certain version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| result_spark = df.toPandas() | ||
|
|
||
| assert_frame_equal(result_spark, result_arrow) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add an assert that the Spark DataFrame has column "B" as a string type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, move other test checks here too.
| # spark dataframe and arrow execution mode enabled dataframe type must match padnads | ||
| assert spark_type == arrow_type == 'string' | ||
| assert isinstance(arrow_first_category_element, str) | ||
| assert isinstance(spark_first_category_element, str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, move these to the other test please.
| result = df.withColumn('time', foo_udf(df.time)) | ||
| self.assertEquals(df.collect(), result.collect()) | ||
|
|
||
| def test_createDateFrame_with_category_type(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test module is for pandas_udfs, not for createDataFrame. We do need to add a pandas_udf that tests this. The user would specify a return type of string and then return a categorical pandas.Series that has string categories. For example:
@pandas_udf('string')
def f(x):
return x.astype('category')
pdf = pd.DataFrame({"A": [u"a", u"b", u"c", u"a"]})
df = spark.createDataFrame(pdf).withColumn("B", f(col("A")))
result = df.toPandas()
# Check result "B" is equal to "A"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha, got it. fixed
|
Test build #123037 has finished for PR 26585 at commit
|
|
Test build #123038 has finished for PR 26585 at commit
|
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think the test asserts should probably be from unittest module and not the built-in assert to be consistent, but not a big deal I can fix that up in a followup PR.
|
merged to master, thanks for all the patience to follow through with this @jalpan-randeri ! |
|
wow! Thank you @BryanCutler. |
|
Hi, Guys. |
|
Please see #28789 . |
|
That was an oversight on my part using the |
Handle Pandas category type while converting from python with Arrow enabled. The category column will be converted to whatever type the category elements are as is the case with Arrow disabled.
Does this PR introduce any user-facing change?
No
How was this patch tested?
New unit tests were added for
createDataFrameand scalarpandas_udf