DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support #28018

galuhsahid · 2019-08-19T18:22:38Z

closes DOC/TST: update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support #27955
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

…gorical support Fixes pandas-dev#27955

Fixes pandas-dev#27955

…cal support Fixes pandas-dev#27955

galuhsahid · 2019-08-19T18:28:29Z

Several questions/thoughts:

Should I also add a test in this part:

pandas/pandas/tests/io/test_parquet.py

Line 409 in 802f670

# additional supported types for pyarrow

as well? I thought adding a test there would be a duplicate of what I'm doing in

pandas/pandas/tests/io/test_parquet.py

Line 451 in 802f670

def test_categorical(self, pa):

, but maybe I'm missing something
In ARROW-5480: [Python] Add unit test asserting specifically that pandas.Categorical roundtrips to Parquet format without special options apache/arrow#5110 the test in pyarrow demonstrates how category values will be preserved regardless whether they occur in the data or not; should the test in pandas demonstrate this as well?

jorisvandenbossche

Thanks a lot for the PR!

Looks very good. One suggestion I have is to add a categorical column in the first example in the docs (the df with columns of all different types that gets created a bit below the place you edited in io.rst).

jorisvandenbossche · 2019-08-19T19:37:52Z

Should I also add a test in this part:

I think your analysis is correct that it would be duplicative.
I think once we don't have the version check anymore (in the future if we would only support pyarrow >= 0.15), then we can move the test_categorical into the test_basic. But for now I would keep it as is.

In apache/arrow#5110 the test in pyarrow demonstrates how category values will be preserved regardless whether they occur in the data or not; should the test in pandas demonstrate this as well?

Yes, I think that is nice to test. In addition, you could also test if the ordered flag is preserved as well. You can add additional columns to the df with different cases in test_categorical.

Closes pandas-dev#27955

galuhsahid · 2019-08-20T18:09:57Z

Thanks for the review! I've added another test for null, out-of-order values & unobserved category and updated the doc as well.

However when I tried to add another test to check if order is preserved, it seems like it's not preserved. I'm not really sure if this is expected or if I'm missing something:

import pandas as pd
from pandas.io.parquet import read_parquet, to_parquet

df = pd.DataFrame()
df["a"] = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True)

df.to_parquet(<path>)
actual = read_parquet(<path>)

df["a"]    
0    NaN
1      b
2      c
3    NaN
Name: a, dtype: category
Categories (3, object): [b < c < d]

actual["a"]
0    NaN
1      b
2      c
3    NaN
Name: a, dtype: category
Categories (3, object): [b, c, d]

If it's expected then maybe I can add another test that expects it to still return a categorical data type without order, if needed.

jorisvandenbossche · 2019-08-20T18:23:12Z

However when I tried to add another test to check if order is preserved, it seems like it's not preserved. I'm not really sure if this is expected or if I'm missing something:

Indeed, I see the same trying your code. That is certainly not expected I think. Can you open an issue for this? (Arrow uses a JIRA tracker: https://issues.apache.org/jira/projects/ARROW/issues; I can also open an issue if you want)

jorisvandenbossche

The additional changes look good to me, thanks!

jorisvandenbossche · 2019-08-20T18:43:46Z

Hmm, the failure on Azure is actually related:

_____________________ TestParquetPyArrow.test_categorical ______________________
[gw0] darwin -- Python 3.5.6 /Users/vsts/miniconda3/envs/pandas-dev/bin/python

self = <pandas.tests.io.test_parquet.TestParquetPyArrow object at 0x146354a90>
pa = 'pyarrow'

    def test_categorical(self, pa):
    
        # supported in >= 0.7.0
        df = pd.DataFrame()
        df["a"] = pd.Categorical(list("abcdef"))
    
        # test for null, out-of-order values, and unobserved category
        dtype = pd.CategoricalDtype(["foo", "bar", "baz"])
        df["b"] = pd.Categorical.from_codes(codes=[1, 0, 0, 1, -1, 1], dtype=dtype)
    
        if LooseVersion(pyarrow.__version__) >= LooseVersion("0.15.0"):
            check_round_trip(df, pa)
        else:
            # de-serialized as object for pyarrow < 0.15
            expected = df.assign(a=df.a.astype(object))
>           check_round_trip(df, pa, expected=expected)

pandas/tests/io/test_parquet.py:468: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_parquet.py:175: in check_round_trip
    compare(repeat)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

repeat = 2

    def compare(repeat):
        for _ in range(repeat):
            df.to_parquet(path, **write_kwargs)
            with catch_warnings(record=True):
                actual = read_parquet(path, **read_kwargs)
    
>           tm.assert_frame_equal(expected, actual, check_names=check_names)
E           AssertionError: Attributes are different
E           
E           Attribute "dtype" are different
E           [left]:  CategoricalDtype(categories=['foo', 'bar', 'baz'], ordered=False)
E           [right]: object

pandas/tests/io/test_parquet.py:171: AssertionError

It seems that the expected still has the categorical dtype (so the assign failed to overwrite the original column), but not sure why this only happens on the mac/windows builds ..

jorisvandenbossche · 2019-08-20T18:45:57Z

Ah, no, I already see ;). In the expected for the branch for older pyarrow versions, you still need to convert column 'b' to object dtype as well

Closes pandas-dev#27955

galuhsahid · 2019-08-20T18:58:00Z

Indeed, I see the same trying your code. That is certainly not expected I think. Can you open an issue for this? (Arrow uses a JIRA tracker: https://issues.apache.org/jira/projects/ARROW/issues; I can also open an issue if you want)

I've opened the issue here: https://issues.apache.org/jira/browse/ARROW-6302

jorisvandenbossche · 2019-08-20T19:00:38Z

Thanks!

* master: (40 commits) DOC: Fix GL01 and GL02 errors in the docstrings (pandas-dev#27988) Remove Encoding of values in char** For Labels (pandas-dev#27618) TYPING: more type hints for io.formats.printing (pandas-dev#27765) TST: fix compression tests when run without virtualenv/condaenv (pandas-dev#28051) DOC: Start 0.25.2 (pandas-dev#28111) DOC: Fix docstrings lack of punctuation (pandas-dev#28031) DOC: Remove alias for numpy.random.randn from the docs (pandas-dev#28082) DOC: update GroupBy.head()/tail() documentation (pandas-dev#27844) BUG: timedelta merge asof with tolerance (pandas-dev#27650) BUG: Series.rename raises error on values accepted by Series construc… (pandas-dev#27814) Preserve index when setting new column on empty dataframe. (pandas-dev#26471) BUG: Fixed groupby quantile for listlike q (pandas-dev#27827) BUG: iter with readonly values, closes pandas-dev#28055 (pandas-dev#28074) TST: non-strict xfail for period test (pandas-dev#28072) DOC: Update whatsnew (pandas-dev#28073) CI: disable codecov (pandas-dev#28065) CI: Set SHA for codecov upload (pandas-dev#28067) BUG: Correct the previous bug fixing on xlim for plotting (pandas-dev#28059) CI: Add pip dependence explicitly (pandas-dev#28008) DOC: Change document code prun in a row (pandas-dev#28029) ...

Closes pandas-dev#27955

galuhsahid · 2019-08-25T10:57:16Z

@jorisvandenbossche I've added the test for the ordered flag. Is there any other test that you think we should have for now?

TomAugspurger · 2019-08-26T19:55:14Z

doc/source/user_guide/io.rst

@@ -4702,7 +4702,7 @@ Several caveats.
  indexes. This extra column can cause problems for non-Pandas consumers that are not expecting it. You can
  force including or omitting indexes with the ``index`` argument, regardless of the underlying engine.
 * Index level names, if specified, must be strings.
-* Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype.
+* Categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as ``object`` dtype.


Is this true with both engines?

I've just tried categorical dtypes using integer and in pyarrow it gets de-serialized into float. Not sure if it's expected, if someone can confirm it'd be great.

>>> import pandas as pd >>> from pandas.io.parquet import read_parquet, to_parquet >>> df["a"] = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True) >>> df["b"] = pd.Categorical([1, 2, 3, 1], categories=[2, 3, 4], ordered=True) >>> df["a"] 0 NaN 1 b 2 c 3 NaN Name: a, dtype: category Categories (3, object): [b < c < d] >>> df["b"] 0 NaN 1 2 2 3 3 NaN Name: b, dtype: category Categories (3, int64): [2 < 3 < 4] >>> df.to_parquet("test_pyarrow.parquet", engine="pyarrow") >>> actual_pyarrow = df.read_parquet("test_pyarrow.parquet", engine="fastparquet") >>> actual_pyarrow["a"] 0 NaN 1 b 2 c 3 NaN Name: a, dtype: category Categories (3, object): [b, c, d] >>> actual_pyarrow["b"] 0 NaN 1 2.0 2 3.0 3 NaN Name: b, dtype: float64

With fastparquet, both string and non-string types get de-serialized back as category, however it does not preserve order for both, while in pyarrow (for string) we do preserve order. Should we make this clear in the docs?

>>> import pandas as pd >>> from pandas.io.parquet import read_parquet, to_parquet >>> df["a"] = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True) >>> df["b"] = pd.Categorical([1, 2, 3, 1], categories=[2, 3, 4], ordered=True) >>> df["a"] 0 NaN 1 b 2 c 3 NaN Name: a, dtype: category Categories (3, object): [b < c < d] >>> df["b"] 0 NaN 1 2 2 3 3 NaN Name: b, dtype: category Categories (3, int64): [2 < 3 < 4] >>> df.to_parquet("test_fastparquet.parquet", engine="fastparquet") >>> actual_fastparquet = df.read_parquet("test_fastparquet.parquet", engine="fastparquet") >>> actual_fastparquet["a"] 0 NaN 1 b 2 c 3 NaN Name: a, dtype: category Categories (3, object): [b, c, d] >>> actual_fastparquet["b"] 0 NaN 1 2 2 3 3 NaN Name: b, dtype: category Categories (3, int64): [2, 3, 4]

TomAugspurger · 2019-08-26T19:59:17Z

pandas/tests/io/test_parquet.py

+        df["a"] = pd.Categorical(list("abcdef"))
+
+        # test for null, out-of-order values, and unobserved category
+        dtype = pd.CategoricalDtype(["foo", "bar", "baz"])


Can you write this as

b = pd.Categorical(['bar', 'foo', 'foo', 'bar', None, 'bar'], dtype=pd.CategoricalDtype(['foo', 'bar', 'baz']))

rather than using from_codes?

TomAugspurger · 2019-08-26T19:59:37Z

pandas/tests/io/test_parquet.py

+            check_round_trip(df, pa)
+        else:
+            # de-serialized as object for pyarrow < 0.15
+            expected = df.assign(


expected = df.astype(object)?

TomAugspurger · 2019-08-26T19:59:56Z

doc/source/whatsnew/v1.0.0.rst

@@ -88,7 +88,7 @@ Categorical
 ^^^^^^^^^^^

 - Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
-
+- Added test to assert roundtripping to parquet with :func:`to_parquet` or :func:`read_parquet` will preserve Categorical dtypes for string types (:issue:`27955`)


DataFrame.to_parquet

Closes pandas-dev#27955

jorisvandenbossche · 2019-08-27T19:08:09Z

I've just tried categorical dtypes using integer and in pyarrow it gets de-serialized into float. Not sure if it's expected, if someone can confirm it'd be great.

For Arrow, that is expected, see https://issues.apache.org/jira/browse/ARROW-6140 (the fact that it is float and not int is because there are missing values)

With fastparquet, both string and non-string types get de-serialized back as category, however it does not preserve order for both, while in pyarrow (for string) we do preserve order. Should we make this clear in the docs?

That could be mentioned yes.

…rquet Closes pandas-dev#27955

galuhsahid · 2019-08-29T10:30:19Z

I've updated the doc to explain more about the differences regarding how the categorical support works between pyarrow and fastparquet

TomAugspurger · 2019-08-30T18:58:10Z

pandas/tests/io/test_parquet.py

-        # de-serialized as object
-        expected = df.assign(a=df.a.astype(object))
-        check_round_trip(df, pa, expected=expected)
+        if LooseVersion(pyarrow.__version__) >= LooseVersion("0.15.0"):


@jorisvandenbossche this isn't released yet, right? Should we wait to merge until 0.15.0 is released?

Yes, this isn't released yet. I can assure that it runs locally for me on Arrow master (if I change the version check to > LooseVersion("0.14.1.dev")), so I am OK to merge this, but also fine to wait a bit more (0.15.0 will normally happen somewhere end of next week)

Since Arrow 0.15.0 has been released, can we merge this now?

jreback

pls merge master

jreback · 2019-10-02T11:51:00Z

doc/source/whatsnew/v1.0.0.rst

@@ -174,6 +174,7 @@ Categorical
 - Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
 - Bug in :meth:`Categorical.astype` where ``NaN`` values were handled incorrectly when casting to int (:issue:`28406`)
 - :meth:`Categorical.searchsorted` and :meth:`CategoricalIndex.searchsorted` now work on unordered categoricals also (:issue:`21667`)
+- Added test to assert roundtripping to parquet with :func:`DataFrame.to_parquet` or :func:`read_parquet` will preserve Categorical dtypes for string types (:issue:`27955`)


this note is not really needed, but ok

jreback · 2019-10-02T11:51:14Z

doc/source/user_guide/io.rst

@@ -4734,7 +4735,8 @@ See the documentation for `pyarrow <https://arrow.apache.org/docs/python/>`__ an
                      'd': np.arange(4.0, 7.0, dtype='float64'),
                      'e': [True, False, True],
                      'f': pd.date_range('20130101', periods=3),
-                      'g': pd.date_range('20130101', periods=3, tz='US/Eastern')})
+                      'g': pd.date_range('20130101', periods=3, tz='US/Eastern'),
+                      'h': pd.Categorical(list('abc'))})


can we add an ordered categorical as well?

* master: (22 commits) DOC: fix PR09,PR08 errors for pandas.Timestamp (pandas-dev#28739) WEB: Add diversity note to team.md (pandas-dev#28630) DOC: Minor fixes in pandas/testing.py docstring. (pandas-dev#28752) TST: port maybe_promote tests from pandas-dev#23982 (pandas-dev#28764) Bugfix/groupby datetime issue (pandas-dev#28569) reenable codecov (pandas-dev#28750) CLN: Centralised _check_percentile (pandas-dev#27584) DEPR: Deprecate Index.set_value (pandas-dev#28621) CLN: Fix typo in contributing.rst (pandas-dev#28761) Fixed docstring errors in pandas.period range and pandas.PeriodIndex (pandas-dev#28756) BUG: Fix TypeError raised in libreduction (pandas-dev#28643) DOC: Pandas.Series.drop docstring PR02 (pandas-dev#27976) (pandas-dev#28742) DOC: Fixed doctring errors PR08, PR09 in pandas.io (pandas-dev#28748) TST: Fix broken test cases where Timedelta/Timestamp raise (pandas-dev#28729) REF: Consolidate alignment calls in DataFrame ops (pandas-dev#28638) BUG: Fix dep generation (pandas-dev#28734) Added doctstring to fixture (pandas-dev#28727) DOC: Fixed PR06 docstrings errors in pandas.timedelta_range (pandas-dev#28719) replaced safe_import with a corresponding test decorator (pandas-dev#28731) BUG: Fix RangeIndex.get_indexer for decreasing RangeIndex (pandas-dev#28680) ...

jorisvandenbossche · 2019-10-04T19:48:25Z

@galuhsahid Thanks a lot for this PR! (and for the fixes on the pyarrow side as well! :-))

…g Categorical support (pandas-dev#28018)

galuhsahid added 3 commits August 20, 2019 00:41

TST: Update pyarrow tests to test parquet faithful roundtrip for Cate…

61888b2

…gorical support Fixes pandas-dev#27955

DOC: Update parquet docs regarding Categorical support

0028293

Fixes pandas-dev#27955

DOC: Add pyarrow test to test parquet faithful roundtrip for Categori…

fdd2e3f

…cal support Fixes pandas-dev#27955

jorisvandenbossche reviewed Aug 19, 2019

View reviewed changes

jorisvandenbossche added Docs IO Parquet parquet, feather Testing pandas testing functions or related to the test suite labels Aug 19, 2019

jorisvandenbossche added this to the 1.0 milestone Aug 19, 2019

galuhsahid added 2 commits August 21, 2019 00:50

DOC: Add categorical column in example DataFrame

ab9b082

Closes pandas-dev#27955

TST: Add test for null, out-of-order values, and unobserved category

cfc53ae

Closes pandas-dev#27955

jorisvandenbossche approved these changes Aug 20, 2019

View reviewed changes

TST: Convert column b in expected to object for pyarrow < 0.15

aebfae7

Closes pandas-dev#27955

galuhsahid force-pushed the parquet-categorical branch from 6267fe0 to aebfae7 Compare August 25, 2019 03:03

galuhsahid added 3 commits August 25, 2019 10:07

TST: Add test for ordered flag

791f164

Closes pandas-dev#27955

TST: Add more rows for ordered flag test

6c6b09d

Closes pandas-dev#27955

TomAugspurger reviewed Aug 26, 2019

View reviewed changes

galuhsahid added 3 commits August 27, 2019 22:45

DOC: Change to_parquet to DataFrame.to_parquet

1b0a444

Closes pandas-dev#27955

TST: Replace from_codes

bf32a80

Closes pandas-dev#27955

TST: Simplify expected conversion to object

159b270

Closes pandas-dev#27955

DOC: Update doc to clarify the differences between pyarrow and fastpa…

4211071

…rquet Closes pandas-dev#27955

DOC: Remove whitespace

a7c414d

TomAugspurger reviewed Aug 30, 2019

View reviewed changes

jorisvandenbossche approved these changes Sep 11, 2019

View reviewed changes

Merge branch 'master' into parquet-categorical

4a6f6ad

jreback requested changes Oct 2, 2019

View reviewed changes

galuhsahid force-pushed the parquet-categorical branch from 5ebb491 to c02059d Compare October 4, 2019 17:45

DOC: Add ordered categorical

bd08e16

jorisvandenbossche merged commit 6056b38 into pandas-dev:master Oct 4, 2019

galuhsahid deleted the parquet-categorical branch October 5, 2019 00:20

galuhsahid restored the parquet-categorical branch October 5, 2019 00:20

galuhsahid deleted the parquet-categorical branch October 5, 2019 00:23

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regardin…

c9783c5

…g Categorical support (pandas-dev#28018)

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regardin…

4c1504c

…g Categorical support (pandas-dev#28018)

bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020

DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regardin…

68a4a6d

…g Categorical support (pandas-dev#28018)

alimcmaster1 mentioned this pull request Apr 23, 2020

DOC: update parquet documentation categorical support #21747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support #28018

DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support #28018

galuhsahid commented Aug 19, 2019

galuhsahid commented Aug 19, 2019

jorisvandenbossche left a comment

jorisvandenbossche commented Aug 19, 2019

galuhsahid commented Aug 20, 2019

jorisvandenbossche commented Aug 20, 2019

jorisvandenbossche left a comment

jorisvandenbossche commented Aug 20, 2019

jorisvandenbossche commented Aug 20, 2019

galuhsahid commented Aug 20, 2019

jorisvandenbossche commented Aug 20, 2019

galuhsahid commented Aug 25, 2019

TomAugspurger Aug 26, 2019

galuhsahid Aug 27, 2019

TomAugspurger Aug 26, 2019

TomAugspurger Aug 26, 2019

TomAugspurger Aug 26, 2019

jorisvandenbossche commented Aug 27, 2019

galuhsahid commented Aug 29, 2019

TomAugspurger Aug 30, 2019

jorisvandenbossche Sep 11, 2019

galuhsahid Oct 2, 2019

jreback left a comment

jreback Oct 2, 2019

jreback Oct 2, 2019

jorisvandenbossche commented Oct 4, 2019

DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support #28018

DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support #28018

Conversation

galuhsahid commented Aug 19, 2019

galuhsahid commented Aug 19, 2019

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 19, 2019

galuhsahid commented Aug 20, 2019

jorisvandenbossche commented Aug 20, 2019

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 20, 2019

jorisvandenbossche commented Aug 20, 2019

galuhsahid commented Aug 20, 2019

jorisvandenbossche commented Aug 20, 2019

galuhsahid commented Aug 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 27, 2019

galuhsahid commented Aug 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 4, 2019