Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support #28018

Merged
merged 17 commits into from
Oct 4, 2019
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4702,7 +4702,8 @@ Several caveats.
indexes. This extra column can cause problems for non-Pandas consumers that are not expecting it. You can
force including or omitting indexes with the ``index`` argument, regardless of the underlying engine.
* Index level names, if specified, must be strings.
* Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype.
* In the ``pyarrow`` engine, categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as their primitive dtype.
* The ``pyarrow`` engine preserves the ``ordered`` flag of categorical dtypes with string types. ``fastparquet`` does not preserve the ``ordered`` flag.
* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message
on an attempt at serialization.

Expand All @@ -4726,7 +4727,8 @@ See the documentation for `pyarrow <https://arrow.apache.org/docs/python/>`__ an
'd': np.arange(4.0, 7.0, dtype='float64'),
'e': [True, False, True],
'f': pd.date_range('20130101', periods=3),
'g': pd.date_range('20130101', periods=3, tz='US/Eastern')})
'g': pd.date_range('20130101', periods=3, tz='US/Eastern'),
'h': pd.Categorical(list('abc'))})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add an ordered categorical as well?


df
df.dtypes
Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Categorical
^^^^^^^^^^^

- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
-
- Added test to assert roundtripping to parquet with :func:`DataFrame.to_parquet` or :func:`read_parquet` will preserve Categorical dtypes for string types (:issue:`27955`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this note is not really needed, but ok

-


Expand Down
25 changes: 21 additions & 4 deletions pandas/tests/io/test_parquet.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
""" test parquet compat """
import datetime
from distutils.version import LooseVersion
import os
from warnings import catch_warnings

Expand Down Expand Up @@ -166,6 +167,7 @@ def compare(repeat):
df.to_parquet(path, **write_kwargs)
with catch_warnings(record=True):
actual = read_parquet(path, **read_kwargs)

tm.assert_frame_equal(expected, actual, check_names=check_names)

if path is None:
Expand Down Expand Up @@ -451,11 +453,26 @@ def test_unsupported(self, pa):
def test_categorical(self, pa):

# supported in >= 0.7.0
df = pd.DataFrame({"a": pd.Categorical(list("abc"))})
df = pd.DataFrame()
df["a"] = pd.Categorical(list("abcdef"))

# test for null, out-of-order values, and unobserved category
df["b"] = pd.Categorical(
["bar", "foo", "foo", "bar", None, "bar"],
dtype=pd.CategoricalDtype(["foo", "bar", "baz"]),
)

# test for ordered flag
df["c"] = pd.Categorical(
["a", "b", "c", "a", "c", "b"], categories=["b", "c", "d"], ordered=True
)

# de-serialized as object
expected = df.assign(a=df.a.astype(object))
check_round_trip(df, pa, expected=expected)
if LooseVersion(pyarrow.__version__) >= LooseVersion("0.15.0"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche this isn't released yet, right? Should we wait to merge until 0.15.0 is released?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this isn't released yet. I can assure that it runs locally for me on Arrow master (if I change the version check to > LooseVersion("0.14.1.dev")), so I am OK to merge this, but also fine to wait a bit more (0.15.0 will normally happen somewhere end of next week)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Arrow 0.15.0 has been released, can we merge this now?

check_round_trip(df, pa)
else:
# de-serialized as object for pyarrow < 0.15
expected = df.astype(object)
check_round_trip(df, pa, expected=expected)

def test_s3_roundtrip(self, df_compat, s3_resource, pa):
# GH #19134
Expand Down