[Python] Pandas categorical type doesn't survive a round-trip through parquet #21930

asfimport · 2019-06-02T16:58:54Z

Writing a string categorical variable to from pandas parquet is read back as string (object dtype). I expected it to be read as category.
The same thing happens if the category is numeric – a numeric category is read back as int64.

In the code below, I tried out an in-memory arrow Table, which successfully translates categories back to pandas. However, when I write to a parquet file, it's not.

In the scheme of things, this isn't a big deal, but it's a small surprise.

import pandas as pd
import pyarrow as pa


df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])})
df.dtypes  # category

# This works:
pa.Table.from_pandas(df).to_pandas().dtypes  # category

df.to_parquet("categories.parquet")
# This reads back object, but I expected category
pd.read_parquet("categories.parquet").dtypes  # object


# Numeric categories have the same issue:
df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])})
df_num.dtypes # category

pa.Table.from_pandas(df_num).to_pandas().dtypes  # category

df_num.to_parquet("categories_num.parquet")
# This reads back int64, but I expected category
pd.read_parquet("categories_num.parquet").dtypes  # int64

Environment: python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 5.0.0-15-generic
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.24.2
numpy: 1.16.4
pyarrow: 0.13.0

Reporter: Karl Dunkle Werner / @karldw
Assignee: Wes McKinney / @wesm

Related issues:

[Python][Parquet] direct reading/writing of pandas categoricals in parquet (relates to)
[Python] CategoricalIndex is lost after reading back (is related to)

Externally tracked issue: pandas-dev/pandas#26616

PRs and other links:

GitHub Pull Request #5110

_{Note: This issue was originally created as ARROW-5480. Please see the migration documentation for further details.}

asfimport · 2019-06-03T02:17:29Z

Wes McKinney / @wesm:
Parquet has dictionary-encoding as a compression strategy but does not have Categorical per se. As part of ARROW-3246 we should eventually be able to preserve Categorical through Parquet round trips, but there's some tricky issues to sort out

asfimport · 2019-06-05T19:15:05Z

Joris Van den Bossche / @jorisvandenbossche:
@wesm I think this can be closed as duplicate of the other issue?

asfimport · 2019-06-05T19:22:53Z

Wes McKinney / @wesm:
I'm not sure – I think the scope of work in ARROW-3246 may be slightly different. I'd like to look at the Parquet-Categorical stuff sometime this month so I can look at both issues more closely then

asfimport · 2019-08-02T02:04:38Z

Wes McKinney / @wesm:
One slightly higher level issue is the extent to which we store Arrow schema information in the Parquet metadata. I have been thinking that we should actually store the whole serialized schema in the Parquet footer as an IPC message, so that we can refer to it when reading the file to set various read options

asfimport · 2019-08-02T09:21:59Z

Joris Van den Bossche / @jorisvandenbossche:

One slightly higher level issue is the extent to which we store Arrow schema information in the Parquet metadata.

Possibly related to ARROW-5888, where we also need to store arrow-specific metadata for faithful roundtrip (in that case the timezone).

Spark stores the all column types (and optional column metadata) in the key_value_metadata of the FileMetadata:

For example for a file with a single int column

>>> meta = pq.read_metadata('test_pyspark_dataset/_metadata')
>>> meta.metadata
{b'org.apache.spark.sql.parquet.row.metadata': b'{"type":"struct","fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]}'}

asfimport · 2019-08-16T14:22:13Z

Wes McKinney / @wesm:
I think this should be working now. I'll add some unit tests to close out this issue

asfimport · 2019-08-19T18:13:50Z

Wes McKinney / @wesm:
Resolved in c4b8cb6

asfimport · 2022-04-14T10:42:11Z

Anton Kukushkin:
The second example with numerical categoricals still doesn't work (7.0.0). Can you reopen, please?

asfimport · 2022-04-15T15:32:02Z

Joris Van den Bossche / @jorisvandenbossche:
[~kukughking] there are already several other issues about this. The underlying issue is that we need to support reading other types than BYTE_ARRAY into dictionary type for Parquet, which is covered in ARROW-6140. As a result, for now the categorical dtype is only preserved for string data and not boolean or numeric data types, see eg ARROW-13342 and ARROW-11157 as other reports on this topic with some additional explanation/discussion.

asfimport closed this as completed Aug 19, 2019

asfimport assigned wesm Jan 10, 2023

asfimport added this to the 0.15.0 milestone Jan 11, 2023

This was referenced Jan 11, 2023

[Python][Parquet] direct reading/writing of pandas categoricals in parquet #19588

Closed

[Python] CategoricalIndex is lost after reading back #19959

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Pandas categorical type doesn't survive a round-trip through parquet #21930

[Python] Pandas categorical type doesn't survive a round-trip through parquet #21930

asfimport commented Jun 2, 2019 •

edited

Loading

asfimport commented Jun 3, 2019

asfimport commented Jun 5, 2019

asfimport commented Jun 5, 2019

asfimport commented Aug 2, 2019

asfimport commented Aug 2, 2019

asfimport commented Aug 16, 2019

asfimport commented Aug 19, 2019

asfimport commented Apr 14, 2022

asfimport commented Apr 15, 2022

[Python] Pandas categorical type doesn't survive a round-trip through parquet #21930

[Python] Pandas categorical type doesn't survive a round-trip through parquet #21930

Comments

asfimport commented Jun 2, 2019 • edited Loading

Related issues:

Externally tracked issue: pandas-dev/pandas#26616

PRs and other links:

asfimport commented Jun 3, 2019

asfimport commented Jun 5, 2019

asfimport commented Jun 5, 2019

asfimport commented Aug 2, 2019

asfimport commented Aug 2, 2019

asfimport commented Aug 16, 2019

asfimport commented Aug 19, 2019

asfimport commented Apr 14, 2022

asfimport commented Apr 15, 2022

asfimport commented Jun 2, 2019 •

edited

Loading