-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][Parquet] direct reading/writing of pandas categoricals in parquet #19588
Comments
Wes McKinney / @wesm: |
Martin Durant / @martindurant:
Yes, exactly what I was trying to say. However, a great optimisation in that specific case. |
Wes McKinney / @wesm: |
Wes McKinney / @wesm: I think the only way to fix the current situation is to add a cc @xhochy @hatemhelal for any thoughts |
Wes McKinney / @wesm: |
Hatem Helal / @hatemhelal: As an aside, has it ever been considered to automatically tune the size of the dictionary page? I think for the limited case where of writing |
Wes McKinney / @wesm: The dictionary page size issue is usually handled through the WriterProperties https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L178 If the dictionary is written all at once then this property can be circumvented, that would be my plan. |
Wes McKinney / @wesm: |
I like that plan. |
Wes McKinney / @wesm: |
Wes McKinney / @wesm: |
Wes McKinney / @wesm: |
Parquet supports "dictionary encoding" of column data in a manner very similar to the concept of Categoricals in pandas. It is natural to use this encoding for a column which originated as a categorical. Conversely, when loading, if the file metadata says that a given column came from a pandas (or arrow) categorical, then we can trust that the whole of the column is dictionary-encoded and load the data directly into a categorical column, rather than expanding the labels upon load and recategorising later.
If the data does not have the pandas metadata, then the guarantee cannot hold, and we cannot assume either that the whole column is dictionary encoded or that the labels are the same throughout. In this case, the current behaviour is fine.
(please forgive that some of this has already been mentioned elsewhere; this is one of the entries in the list at dask/fastparquet#374 as a feature that is useful in fastparquet)
Reporter: Martin Durant / @martindurant
Assignee: Wes McKinney / @wesm
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-3246. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: