Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support #28018

Merged
merged 17 commits into from
Oct 4, 2019

Conversation

galuhsahid
Copy link
Contributor

@galuhsahid
Copy link
Contributor Author

Several questions/thoughts:

  1. Should I also add a test in this part:
    # additional supported types for pyarrow
    as well? I thought adding a test there would be a duplicate of what I'm doing in
    def test_categorical(self, pa):
    , but maybe I'm missing something
  2. In ARROW-5480: [Python] Add unit test asserting specifically that pandas.Categorical roundtrips to Parquet format without special options apache/arrow#5110 the test in pyarrow demonstrates how category values will be preserved regardless whether they occur in the data or not; should the test in pandas demonstrate this as well?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR!

Looks very good. One suggestion I have is to add a categorical column in the first example in the docs (the df with columns of all different types that gets created a bit below the place you edited in io.rst).

@jorisvandenbossche jorisvandenbossche added Docs IO Parquet parquet, feather Testing pandas testing functions or related to the test suite labels Aug 19, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Aug 19, 2019
@jorisvandenbossche
Copy link
Member

Should I also add a test in this part:

I think your analysis is correct that it would be duplicative.
I think once we don't have the version check anymore (in the future if we would only support pyarrow >= 0.15), then we can move the test_categorical into the test_basic. But for now I would keep it as is.

In apache/arrow#5110 the test in pyarrow demonstrates how category values will be preserved regardless whether they occur in the data or not; should the test in pandas demonstrate this as well?

Yes, I think that is nice to test. In addition, you could also test if the ordered flag is preserved as well. You can add additional columns to the df with different cases in test_categorical.

@galuhsahid
Copy link
Contributor Author

Thanks for the review! I've added another test for null, out-of-order values & unobserved category and updated the doc as well.

However when I tried to add another test to check if order is preserved, it seems like it's not preserved. I'm not really sure if this is expected or if I'm missing something:

import pandas as pd
from pandas.io.parquet import read_parquet, to_parquet

df = pd.DataFrame()
df["a"] = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True)

df.to_parquet(<path>)
actual = read_parquet(<path>)

df["a"]    
0    NaN
1      b
2      c
3    NaN
Name: a, dtype: category
Categories (3, object): [b < c < d]

actual["a"]
0    NaN
1      b
2      c
3    NaN
Name: a, dtype: category
Categories (3, object): [b, c, d]

If it's expected then maybe I can add another test that expects it to still return a categorical data type without order, if needed.

@jorisvandenbossche
Copy link
Member

However when I tried to add another test to check if order is preserved, it seems like it's not preserved. I'm not really sure if this is expected or if I'm missing something:

Indeed, I see the same trying your code. That is certainly not expected I think. Can you open an issue for this? (Arrow uses a JIRA tracker: https://issues.apache.org/jira/projects/ARROW/issues; I can also open an issue if you want)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional changes look good to me, thanks!

@jorisvandenbossche
Copy link
Member

Hmm, the failure on Azure is actually related:

_____________________ TestParquetPyArrow.test_categorical ______________________
[gw0] darwin -- Python 3.5.6 /Users/vsts/miniconda3/envs/pandas-dev/bin/python

self = <pandas.tests.io.test_parquet.TestParquetPyArrow object at 0x146354a90>
pa = 'pyarrow'

    def test_categorical(self, pa):
    
        # supported in >= 0.7.0
        df = pd.DataFrame()
        df["a"] = pd.Categorical(list("abcdef"))
    
        # test for null, out-of-order values, and unobserved category
        dtype = pd.CategoricalDtype(["foo", "bar", "baz"])
        df["b"] = pd.Categorical.from_codes(codes=[1, 0, 0, 1, -1, 1], dtype=dtype)
    
        if LooseVersion(pyarrow.__version__) >= LooseVersion("0.15.0"):
            check_round_trip(df, pa)
        else:
            # de-serialized as object for pyarrow < 0.15
            expected = df.assign(a=df.a.astype(object))
>           check_round_trip(df, pa, expected=expected)

pandas/tests/io/test_parquet.py:468: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_parquet.py:175: in check_round_trip
    compare(repeat)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

repeat = 2

    def compare(repeat):
        for _ in range(repeat):
            df.to_parquet(path, **write_kwargs)
            with catch_warnings(record=True):
                actual = read_parquet(path, **read_kwargs)
    
>           tm.assert_frame_equal(expected, actual, check_names=check_names)
E           AssertionError: Attributes are different
E           
E           Attribute "dtype" are different
E           [left]:  CategoricalDtype(categories=['foo', 'bar', 'baz'], ordered=False)
E           [right]: object

pandas/tests/io/test_parquet.py:171: AssertionError

It seems that the expected still has the categorical dtype (so the assign failed to overwrite the original column), but not sure why this only happens on the mac/windows builds ..

@jorisvandenbossche
Copy link
Member

Ah, no, I already see ;). In the expected for the branch for older pyarrow versions, you still need to convert column 'b' to object dtype as well

@galuhsahid
Copy link
Contributor Author

Indeed, I see the same trying your code. That is certainly not expected I think. Can you open an issue for this? (Arrow uses a JIRA tracker: https://issues.apache.org/jira/projects/ARROW/issues; I can also open an issue if you want)

I've opened the issue here: https://issues.apache.org/jira/browse/ARROW-6302

@jorisvandenbossche
Copy link
Member

Thanks!

* master: (40 commits)
  DOC: Fix GL01 and GL02 errors in the docstrings (pandas-dev#27988)
  Remove Encoding of values in char** For Labels (pandas-dev#27618)
  TYPING: more type hints for io.formats.printing (pandas-dev#27765)
  TST: fix compression tests when run without virtualenv/condaenv (pandas-dev#28051)
  DOC: Start 0.25.2 (pandas-dev#28111)
  DOC: Fix docstrings lack of punctuation (pandas-dev#28031)
  DOC: Remove alias for numpy.random.randn from the docs (pandas-dev#28082)
  DOC: update GroupBy.head()/tail() documentation (pandas-dev#27844)
  BUG: timedelta merge asof with tolerance (pandas-dev#27650)
  BUG: Series.rename raises error on values accepted by Series construc… (pandas-dev#27814)
  Preserve index when setting new column on empty dataframe. (pandas-dev#26471)
  BUG: Fixed groupby quantile for listlike q (pandas-dev#27827)
  BUG: iter with readonly values, closes pandas-dev#28055 (pandas-dev#28074)
  TST: non-strict xfail for period test (pandas-dev#28072)
  DOC: Update whatsnew (pandas-dev#28073)
  CI: disable codecov (pandas-dev#28065)
  CI: Set SHA for codecov upload (pandas-dev#28067)
  BUG: Correct the previous bug fixing on xlim for plotting (pandas-dev#28059)
  CI: Add pip dependence explicitly (pandas-dev#28008)
  DOC: Change document code prun in a row (pandas-dev#28029)
  ...
@galuhsahid
Copy link
Contributor Author

@jorisvandenbossche I've added the test for the ordered flag. Is there any other test that you think we should have for now?

@@ -4702,7 +4702,7 @@ Several caveats.
indexes. This extra column can cause problems for non-Pandas consumers that are not expecting it. You can
force including or omitting indexes with the ``index`` argument, regardless of the underlying engine.
* Index level names, if specified, must be strings.
* Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype.
* Categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as ``object`` dtype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true with both engines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I've just tried categorical dtypes using integer and in pyarrow it gets de-serialized into float. Not sure if it's expected, if someone can confirm it'd be great.
>>> import pandas as pd
>>> from pandas.io.parquet import read_parquet, to_parquet

>>> df["a"] = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True)
>>> df["b"] = pd.Categorical([1, 2, 3, 1], categories=[2, 3, 4], ordered=True)

>>> df["a"]
0    NaN
1      b
2      c
3    NaN
Name: a, dtype: category
Categories (3, object): [b < c < d]

>>> df["b"]
0    NaN
1      2
2      3
3    NaN
Name: b, dtype: category
Categories (3, int64): [2 < 3 < 4]

>>> df.to_parquet("test_pyarrow.parquet", engine="pyarrow")
>>> actual_pyarrow = df.read_parquet("test_pyarrow.parquet", engine="fastparquet")

>>> actual_pyarrow["a"]
0    NaN
1      b
2      c
3    NaN
Name: a, dtype: category
Categories (3, object): [b, c, d]

>>> actual_pyarrow["b"]
0    NaN
1    2.0
2    3.0
3    NaN
Name: b, dtype: float64
  • With fastparquet, both string and non-string types get de-serialized back as category, however it does not preserve order for both, while in pyarrow (for string) we do preserve order. Should we make this clear in the docs?
>>> import pandas as pd
>>> from pandas.io.parquet import read_parquet, to_parquet

>>> df["a"] = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True)
>>> df["b"] = pd.Categorical([1, 2, 3, 1], categories=[2, 3, 4], ordered=True)

>>> df["a"]
0    NaN
1      b
2      c
3    NaN
Name: a, dtype: category
Categories (3, object): [b < c < d]

>>> df["b"]
0    NaN
1      2
2      3
3    NaN
Name: b, dtype: category
Categories (3, int64): [2 < 3 < 4]

>>> df.to_parquet("test_fastparquet.parquet", engine="fastparquet")
>>> actual_fastparquet = df.read_parquet("test_fastparquet.parquet", engine="fastparquet")

>>> actual_fastparquet["a"]
0    NaN
1      b
2      c
3    NaN
Name: a, dtype: category
Categories (3, object): [b, c, d]

>>> actual_fastparquet["b"]
0    NaN
1      2
2      3
3    NaN
Name: b, dtype: category
Categories (3, int64): [2, 3, 4]

df["a"] = pd.Categorical(list("abcdef"))

# test for null, out-of-order values, and unobserved category
dtype = pd.CategoricalDtype(["foo", "bar", "baz"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write this as

b = pd.Categorical(['bar', 'foo', 'foo', 'bar', None, 'bar'], dtype=pd.CategoricalDtype(['foo', 'bar', 'baz']))

rather than using from_codes?

check_round_trip(df, pa)
else:
# de-serialized as object for pyarrow < 0.15
expected = df.assign(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected = df.astype(object)?

@@ -88,7 +88,7 @@ Categorical
^^^^^^^^^^^

- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
-
- Added test to assert roundtripping to parquet with :func:`to_parquet` or :func:`read_parquet` will preserve Categorical dtypes for string types (:issue:`27955`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame.to_parquet

@jorisvandenbossche
Copy link
Member

I've just tried categorical dtypes using integer and in pyarrow it gets de-serialized into float. Not sure if it's expected, if someone can confirm it'd be great.

For Arrow, that is expected, see https://issues.apache.org/jira/browse/ARROW-6140 (the fact that it is float and not int is because there are missing values)

With fastparquet, both string and non-string types get de-serialized back as category, however it does not preserve order for both, while in pyarrow (for string) we do preserve order. Should we make this clear in the docs?

That could be mentioned yes.

@galuhsahid
Copy link
Contributor Author

I've updated the doc to explain more about the differences regarding how the categorical support works between pyarrow and fastparquet

# de-serialized as object
expected = df.assign(a=df.a.astype(object))
check_round_trip(df, pa, expected=expected)
if LooseVersion(pyarrow.__version__) >= LooseVersion("0.15.0"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche this isn't released yet, right? Should we wait to merge until 0.15.0 is released?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this isn't released yet. I can assure that it runs locally for me on Arrow master (if I change the version check to > LooseVersion("0.14.1.dev")), so I am OK to merge this, but also fine to wait a bit more (0.15.0 will normally happen somewhere end of next week)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Arrow 0.15.0 has been released, can we merge this now?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls merge master

@@ -174,6 +174,7 @@ Categorical
- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
- Bug in :meth:`Categorical.astype` where ``NaN`` values were handled incorrectly when casting to int (:issue:`28406`)
- :meth:`Categorical.searchsorted` and :meth:`CategoricalIndex.searchsorted` now work on unordered categoricals also (:issue:`21667`)
- Added test to assert roundtripping to parquet with :func:`DataFrame.to_parquet` or :func:`read_parquet` will preserve Categorical dtypes for string types (:issue:`27955`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this note is not really needed, but ok

@@ -4734,7 +4735,8 @@ See the documentation for `pyarrow <https://arrow.apache.org/docs/python/>`__ an
'd': np.arange(4.0, 7.0, dtype='float64'),
'e': [True, False, True],
'f': pd.date_range('20130101', periods=3),
'g': pd.date_range('20130101', periods=3, tz='US/Eastern')})
'g': pd.date_range('20130101', periods=3, tz='US/Eastern'),
'h': pd.Categorical(list('abc'))})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add an ordered categorical as well?

* master: (22 commits)
  DOC: fix PR09,PR08 errors for pandas.Timestamp (pandas-dev#28739)
  WEB: Add diversity note to team.md (pandas-dev#28630)
  DOC: Minor fixes in pandas/testing.py docstring. (pandas-dev#28752)
  TST: port maybe_promote tests from pandas-dev#23982 (pandas-dev#28764)
  Bugfix/groupby datetime issue (pandas-dev#28569)
  reenable codecov (pandas-dev#28750)
  CLN: Centralised _check_percentile (pandas-dev#27584)
  DEPR: Deprecate Index.set_value (pandas-dev#28621)
  CLN: Fix typo in contributing.rst (pandas-dev#28761)
  Fixed docstring errors in pandas.period range and pandas.PeriodIndex (pandas-dev#28756)
  BUG: Fix TypeError raised in libreduction (pandas-dev#28643)
  DOC: Pandas.Series.drop docstring PR02 (pandas-dev#27976) (pandas-dev#28742)
  DOC: Fixed doctring errors PR08, PR09 in pandas.io (pandas-dev#28748)
  TST: Fix broken test cases where Timedelta/Timestamp raise (pandas-dev#28729)
  REF: Consolidate alignment calls in DataFrame ops (pandas-dev#28638)
  BUG: Fix dep generation (pandas-dev#28734)
  Added doctstring to fixture (pandas-dev#28727)
  DOC: Fixed PR06 docstrings errors in pandas.timedelta_range (pandas-dev#28719)
  replaced safe_import with a corresponding test decorator (pandas-dev#28731)
  BUG: Fix RangeIndex.get_indexer for decreasing RangeIndex (pandas-dev#28680)
  ...
@jorisvandenbossche jorisvandenbossche merged commit 6056b38 into pandas-dev:master Oct 4, 2019
@jorisvandenbossche
Copy link
Member

@galuhsahid Thanks a lot for this PR! (and for the fixes on the pyarrow side as well! :-))

@galuhsahid galuhsahid deleted the parquet-categorical branch October 5, 2019 00:20
@galuhsahid galuhsahid restored the parquet-categorical branch October 5, 2019 00:20
@galuhsahid galuhsahid deleted the parquet-categorical branch October 5, 2019 00:23
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO Parquet parquet, feather Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC/TST: update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support
4 participants