Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Parquet] direct reading/writing of pandas categoricals in parquet #19588

Closed
asfimport opened this issue Sep 17, 2018 · 12 comments
Closed

Comments

@asfimport
Copy link
Collaborator

asfimport commented Sep 17, 2018

Parquet supports "dictionary encoding" of column data in a manner very similar to the concept of Categoricals in pandas. It is natural to use this encoding for a column which originated as a categorical. Conversely, when loading, if the file metadata says that a given column came from a pandas (or arrow) categorical, then we can trust that the whole of the column is dictionary-encoded and load the data directly into a categorical column, rather than expanding the labels upon load and recategorising later.

If the data does not have the pandas metadata, then the guarantee cannot hold, and we cannot assume either that the whole column is dictionary encoded or that the labels are the same throughout. In this case, the current behaviour is fine.

 

(please forgive that some of this has already been mentioned elsewhere; this is one of the entries in the list at dask/fastparquet#374 as a feature that is useful in fastparquet)

Reporter: Martin Durant / @martindurant
Assignee: Wes McKinney / @wesm

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-3246. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
This can only be implemented in the narrow case where there is metadata indicating that the dictionary in each row group is expected to be the same (as a result of having been written by pandas). Otherwise, in general, the observed dictionaries may not be the same from row group to row group

@asfimport
Copy link
Collaborator Author

Martin Durant / @martindurant:

can only be implemented in the narrow case

 

Yes, exactly what I was trying to say. However, a great optimisation in that specific case.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I moved this to 0.14. A bit of work will be needed in order to be able to sidestep hashing to categorical. If we can read BYTE_ARRAY columns directly back as Categorical (but have to hash) that is a good first step.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I've been looking at what's required to write arrow::DictionaryArray directly into the appropriate lower-level ColumnWriter class. The trouble with the way the software is layered right now is that there is a "Chinese wall" between TypedColumnWriter<T> and the Arrow write layer. We can only communicate with this class using the Parquet C types such as ByteArray and FixedLenByteArray. This is also a performance issue since we cannot write directly into the writer from arrow::BinaryArray or similar cases where it might make sense.

I think the only way to fix the current situation is to add a TypedColumnWriter<T>::WriteArrow(const ::arrow::Array&) method and "push down" a lot of the logic that's currently in parquet/arrow/writer.cc into the TypedColumnWriter<T> implementation. This will enable us to do various write performance optimizations and also address the direct dictionary write issue. This is not a small project, but I would say that it's overdue and will put us on a better footing going forward

cc @xhochy @hatemhelal for any thoughts

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I created ARROW-6152 to cover the initial feature-preserving refactoring. I estimate about a day of effort for that, will report in once I make a little progress

@asfimport
Copy link
Collaborator Author

Hatem Helal / @hatemhelal:
Adding TypedColumnWriter<T>::WriteArrow(const ::arrow::Array&) makes a lot of sense to me. @wesm do you have a list of cases that you know can be optimized? The main one I'm aware of is the dictionary array case, but but I'm curious if there are others arrow types that could be handled more efficiently.

As an aside, has it ever been considered to automatically tune the size of the dictionary page? I think for the limited case where of writing arrow::DictionaryArray we might want to ensure that the encoder doesn't fallback to plain encoding. That could be handled as a separate feature.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Writing BYTE_ARRAY can also definitely be made more efficient. See logic at

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L858

The dictionary page size issue is usually handled through the WriterProperties

https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L178

If the dictionary is written all at once then this property can be circumvented, that would be my plan.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
OK, I was able to get the initial refactor done today. Now we need the plumbing to be able to write dictionary values and indices separately to DictEncoder<T>

@asfimport
Copy link
Collaborator Author

Hatem Helal / @hatemhelal:

  If the dictionary is written all at once then this property can be circumvented, that would be my plan.

I like that plan.
 
 

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Making some progress on this. It's a can of worms because of the interplay between the ColumnWriter, Encoder, and Statistics types.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
This has been quite the saga, but I should be able to get a patch up for this tomorrow. I have to decide how to get the dictionary types to be automatically read correctly without setting the read_dictionary property

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Issue resolved by pull request 5077
#5077

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants