[Python][Parquet] direct reading/writing of pandas categoricals in parquet #19588

asfimport · 2018-09-17T00:00:18Z

Parquet supports "dictionary encoding" of column data in a manner very similar to the concept of Categoricals in pandas. It is natural to use this encoding for a column which originated as a categorical. Conversely, when loading, if the file metadata says that a given column came from a pandas (or arrow) categorical, then we can trust that the whole of the column is dictionary-encoded and load the data directly into a categorical column, rather than expanding the labels upon load and recategorising later.

If the data does not have the pandas metadata, then the guarantee cannot hold, and we cannot assume either that the whole column is dictionary encoded or that the labels are the same throughout. In this case, the current behaviour is fine.

(please forgive that some of this has already been mentioned elsewhere; this is one of the entries in the list at dask/fastparquet#374 as a feature that is useful in fastparquet)

Reporter: Martin Durant / @martindurant
Assignee: Wes McKinney / @wesm

Related issues:

[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray (relates to)
[Python] CategoricalIndex is lost after reading back (is related to)
[C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size (is related to)
[C++] Provide public API to access dictionary-encoded indices and values (is related to)
[C++] Persist original type metadata from Arrow schemas (is related to)
[Python] Support reading Parquet binary/string columns directly as DictionaryArray (is related to)
[Python] Pandas categorical type doesn't survive a round-trip through parquet (is related to)
[Python] Column metadata is not saved or loaded in parquet (supercedes)
[C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter (depends upon)

PRs and other links:

GitHub Pull Request #5077

_{Note: This issue was originally created as ARROW-3246. Please see the migration documentation for further details.}

asfimport · 2018-09-17T01:59:36Z

Wes McKinney / @wesm:
This can only be implemented in the narrow case where there is metadata indicating that the dictionary in each row group is expected to be the same (as a result of having been written by pandas). Otherwise, in general, the observed dictionaries may not be the same from row group to row group

asfimport · 2018-09-17T02:25:58Z

Martin Durant / @martindurant:

can only be implemented in the narrow case

Yes, exactly what I was trying to say. However, a great optimisation in that specific case.

asfimport · 2019-03-12T21:04:32Z

Wes McKinney / @wesm:
I moved this to 0.14. A bit of work will be needed in order to be able to sidestep hashing to categorical. If we can read BYTE_ARRAY columns directly back as Categorical (but have to hash) that is a good first step.

asfimport · 2019-08-06T21:57:40Z

Wes McKinney / @wesm:
I've been looking at what's required to write arrow::DictionaryArray directly into the appropriate lower-level ColumnWriter class. The trouble with the way the software is layered right now is that there is a "Chinese wall" between TypedColumnWriter<T> and the Arrow write layer. We can only communicate with this class using the Parquet C types such as ByteArray and FixedLenByteArray. This is also a performance issue since we cannot write directly into the writer from arrow::BinaryArray or similar cases where it might make sense.

I think the only way to fix the current situation is to add a TypedColumnWriter<T>::WriteArrow(const ::arrow::Array&) method and "push down" a lot of the logic that's currently in parquet/arrow/writer.cc into the TypedColumnWriter<T> implementation. This will enable us to do various write performance optimizations and also address the direct dictionary write issue. This is not a small project, but I would say that it's overdue and will put us on a better footing going forward

cc @xhochy @hatemhelal for any thoughts

asfimport · 2019-08-06T22:02:30Z

Wes McKinney / @wesm:
I created ARROW-6152 to cover the initial feature-preserving refactoring. I estimate about a day of effort for that, will report in once I make a little progress

asfimport · 2019-08-07T11:19:24Z

Hatem Helal / @hatemhelal:
Adding TypedColumnWriter<T>::WriteArrow(const ::arrow::Array&) makes a lot of sense to me. @wesm do you have a list of cases that you know can be optimized? The main one I'm aware of is the dictionary array case, but but I'm curious if there are others arrow types that could be handled more efficiently.

As an aside, has it ever been considered to automatically tune the size of the dictionary page? I think for the limited case where of writing arrow::DictionaryArray we might want to ensure that the encoder doesn't fallback to plain encoding. That could be handled as a separate feature.

asfimport · 2019-08-07T17:41:10Z

Wes McKinney / @wesm:
Writing BYTE_ARRAY can also definitely be made more efficient. See logic at

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L858

The dictionary page size issue is usually handled through the WriterProperties

https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L178

If the dictionary is written all at once then this property can be circumvented, that would be my plan.

asfimport · 2019-08-08T04:03:21Z

Wes McKinney / @wesm:
OK, I was able to get the initial refactor done today. Now we need the plumbing to be able to write dictionary values and indices separately to DictEncoder<T>

asfimport · 2019-08-08T11:39:55Z

Hatem Helal / @hatemhelal:

If the dictionary is written all at once then this property can be circumvented, that would be my plan.

I like that plan.

asfimport · 2019-08-09T16:34:30Z

Wes McKinney / @wesm:
Making some progress on this. It's a can of worms because of the interplay between the ColumnWriter, Encoder, and Statistics types.

asfimport · 2019-08-13T02:08:35Z

Wes McKinney / @wesm:
This has been quite the saga, but I should be able to get a patch up for this tomorrow. I have to decide how to get the dictionary types to be automatically read correctly without setting the read_dictionary property

asfimport · 2019-08-16T13:54:07Z

Wes McKinney / @wesm:
Issue resolved by pull request 5077
#5077

asfimport closed this as completed Aug 16, 2019

asfimport assigned wesm Jan 10, 2023

asfimport added this to the 0.15.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][Parquet] direct reading/writing of pandas categoricals in parquet #19588

[Python][Parquet] direct reading/writing of pandas categoricals in parquet #19588

asfimport commented Sep 17, 2018 •

edited

Loading

asfimport commented Sep 17, 2018

asfimport commented Sep 17, 2018

asfimport commented Mar 12, 2019

asfimport commented Aug 6, 2019

asfimport commented Aug 6, 2019

asfimport commented Aug 7, 2019

asfimport commented Aug 7, 2019

asfimport commented Aug 8, 2019

asfimport commented Aug 8, 2019

asfimport commented Aug 9, 2019

asfimport commented Aug 13, 2019

asfimport commented Aug 16, 2019

[Python][Parquet] direct reading/writing of pandas categoricals in parquet #19588

[Python][Parquet] direct reading/writing of pandas categoricals in parquet #19588

Comments

asfimport commented Sep 17, 2018 • edited Loading

Related issues:

PRs and other links:

asfimport commented Sep 17, 2018

asfimport commented Sep 17, 2018

asfimport commented Mar 12, 2019

asfimport commented Aug 6, 2019

asfimport commented Aug 6, 2019

asfimport commented Aug 7, 2019

asfimport commented Aug 7, 2019

asfimport commented Aug 8, 2019

asfimport commented Aug 8, 2019

asfimport commented Aug 9, 2019

asfimport commented Aug 13, 2019

asfimport commented Aug 16, 2019

asfimport commented Sep 17, 2018 •

edited

Loading