Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Pandas categorical type doesn't survive a round-trip through parquet #21930

Closed
asfimport opened this issue Jun 2, 2019 · 9 comments
Closed

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jun 2, 2019

Writing a string categorical variable to from pandas parquet is read back as string (object dtype). I expected it to be read as category.
The same thing happens if the category is numeric – a numeric category is read back as int64.

In the code below, I tried out an in-memory arrow Table, which successfully translates categories back to pandas. However, when I write to a parquet file, it's not.

In the scheme of things, this isn't a big deal, but it's a small surprise.

import pandas as pd
import pyarrow as pa


df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])})
df.dtypes  # category

# This works:
pa.Table.from_pandas(df).to_pandas().dtypes  # category

df.to_parquet("categories.parquet")
# This reads back object, but I expected category
pd.read_parquet("categories.parquet").dtypes  # object


# Numeric categories have the same issue:
df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])})
df_num.dtypes # category

pa.Table.from_pandas(df_num).to_pandas().dtypes  # category

df_num.to_parquet("categories_num.parquet")
# This reads back int64, but I expected category
pd.read_parquet("categories_num.parquet").dtypes  # int64

Environment: python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 5.0.0-15-generic
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.24.2
numpy: 1.16.4
pyarrow: 0.13.0

Reporter: Karl Dunkle Werner / @karldw
Assignee: Wes McKinney / @wesm

Related issues:

Externally tracked issue: pandas-dev/pandas#26616

PRs and other links:

Note: This issue was originally created as ARROW-5480. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Parquet has dictionary-encoding as a compression strategy but does not have Categorical per se. As part of ARROW-3246 we should eventually be able to preserve Categorical through Parquet round trips, but there's some tricky issues to sort out

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
@wesm I think this can be closed as duplicate of the other issue?

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I'm not sure – I think the scope of work in ARROW-3246 may be slightly different. I'd like to look at the Parquet-Categorical stuff sometime this month so I can look at both issues more closely then

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
One slightly higher level issue is the extent to which we store Arrow schema information in the Parquet metadata. I have been thinking that we should actually store the whole serialized schema in the Parquet footer as an IPC message, so that we can refer to it when reading the file to set various read options

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:

One slightly higher level issue is the extent to which we store Arrow schema information in the Parquet metadata.

Possibly related to ARROW-5888, where we also need to store arrow-specific metadata for faithful roundtrip (in that case the timezone).

Spark stores the all column types (and optional column metadata) in the key_value_metadata of the FileMetadata:

For example for a file with a single int column

>>> meta = pq.read_metadata('test_pyspark_dataset/_metadata')
>>> meta.metadata
{b'org.apache.spark.sql.parquet.row.metadata': b'{"type":"struct","fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]}'}

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I think this should be working now. I'll add some unit tests to close out this issue

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Resolved in c4b8cb6

@asfimport
Copy link
Collaborator Author

Anton Kukushkin:
The second example with numerical categoricals still doesn't work (7.0.0). Can you reopen, please?

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
[~kukughking] there are already several other issues about this. The underlying issue is that we need to support reading other types than BYTE_ARRAY into dictionary type for Parquet, which is covered in ARROW-6140. As a result, for now the categorical dtype is only preserved for string data and not boolean or numeric data types, see eg ARROW-13342 and ARROW-11157 as other reports on this topic with some additional explanation/discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants