[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray #20110

asfimport · 2018-06-13T14:08:14Z

Dictionary data is very common in parquet, in the current implementation parquet-cpp decodes dictionary encoded data always before creating a plain arrow array. This process is wasteful since we could use arrow's DictionaryArray directly and achieve several benefits:

Smaller memory footprint - both in the decoding process and in the resulting arrow table - especially when the dict values are large
Better decoding performance - mostly as a result of the first bullet - less memory fetches and less allocations.

I think those benefits could achieve significant improvements in runtime.

My direction for the implementation is to read the indices (through the DictionaryDecoder, after the RLE decoding) and values separately into 2 arrays and create a DictionaryArray using them.

There are some questions to discuss:
Should this be the default behavior for dictionary encoded data
Should it be controlled with a parameter in the API
What should be the policy in case some of the chunks are dictionary encoded and some are not.

I started implementing this but would like to hear your opinions.

Reporter: Stav Nir
Assignee: Wes McKinney / @wesm

Related issues:

[C++][Parquet] Support direct dictionary decoding of types other than BYTE_ARRAY (relates to)
[Python] CategoricalIndex is lost after reading back (is related to)
[Python] Reading a dictionary column from Parquet results in disproportionate memory usage (is related to)
[C++] Provide method on AdaptiveIntBuilder for appending integer Array types (is related to)
[Python] Support reading Parquet binary/string columns directly as DictionaryArray (is related to)
[C++] Support using Array::View from compatible dictionary type to another (is related to)
[Python][Parquet] direct reading/writing of pandas categoricals in parquet (is related to)
[Python] Support reading Parquet binary/string columns directly as DictionaryArray (is depended upon by)

PRs and other links:

GitHub Pull Request #4949

_{Note: This issue was originally created as ARROW-3772. Please see the migration documentation for further details.}

asfimport · 2018-06-13T14:16:58Z

Stav Nir:
@wesm Would be great to hear your thoughts on this.

asfimport · 2018-06-13T18:58:58Z

Wes McKinney / @wesm:
This has been discussed many other places – there may be a JIRA already about this.

To your questions

No, I think this should be optional
Yes
Probably want to fall back to dense output. We could also dictionary encode the non-dictionary encoding stuff using tools in arrow::compute, but that will be quite a bit of work.

This task is loaded with pitfalls:

The dictionary will not be known up front, so the Arrow schema cannot be inferred from the file metadata alone like it is now. The schema will have to be modified later
The dictionary will often be different from row group to row group and from file to file – data written by Spark or Impala or most systems will have this problem
A ColumnChunk may start with dictionary encoding and then switch to plain encoding mid stream

It's probably because of these complexities that we have not undertaken this work yet. Your help would be appreciated, though. Keep in mind there's a pretty good sized patch up right now on the Arrow decode path: ARROW-1644: [C++] Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow parquet-cpp#462

asfimport · 2018-11-12T22:06:58Z

Wes McKinney / @wesm:
Moved issue to Arrow issue tracker

asfimport · 2019-02-28T07:42:06Z

Hatem Helal / @hatemhelal:
I'd like to take a stab at this after ARROW-3769

asfimport · 2019-06-19T01:37:19Z

Wes McKinney / @wesm:
Realistically I don't think I can get this done this week (or in time for 0.14.0), and I think it would be worth giving the feature some care and attention rather than rushing it. Moving to the 1.0.0 milestone

asfimport · 2019-07-18T03:17:35Z

Wes McKinney / @wesm:
I'm looking at this. This is not a small project – the assumption that values are fully materialized is pretty deeply baked into the library. We also have to deal with the "fallback" case where a column chunk starts out dictionary encoded and switches mid-stream because the dictionary got too big. What to do in that case is ambiguous:

One option is to dictionary-encode the additional pages, so we could end up with one big dictionary
Another option is to optimistically leave things dictionary-encoded, and if we hit the fallback case then we fully materialize. We can always do a cast on the Arrow side after the fact in this case

FWIW, the fallback scenario is not at all esoteric because the default dictionary pagesize limit in the C++ library is 1MB. I think Java is the same

https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L44

I think adding an option to raise the limit to 2GB or so when writing Arrow DictionaryArray would help.

Things are made a bit more complex by the code duplication between parquet/column_reader.cc and parquet/arrow/record_reader.cc. I'll see if there's some things I can do to fix that while I'm working on this

asfimport · 2019-07-18T04:27:40Z

Micah Kornfield / @emkornfield:
"I'm looking at this. This is not a small project – the assumption that values are fully materialized is pretty deeply baked into the library. We also have to deal with the "fallback" case where a column chunk starts out dictionary encoded and switches mid-stream because the dictionary got too big"

I don't have context on how we decided originally to designate an entire column dictionary encoded vs a chunk/record batch column but it seems like this might be another use-case where the proposal on encoding/compression might make things easier to code (i.e. specify dictionary encoding only on SparseRecordBatches where it makes sense and leave the fallback to dense encoding where it no longer makes sense).

asfimport · 2019-07-18T14:58:53Z

Wes McKinney / @wesm:
At least after ARROW-3144 we have broken the constraint of a constant dictionary across arrays. Having a mix of dictionary-encoded and non-dictionary-encoded arrays is interesting, but regardless there's a lot of refactoring to do in the Parquet library to expose these details

asfimport · 2019-07-21T14:08:32Z

Wes McKinney / @wesm:
I'm getting close to having something PR-worthy here, ended up being a can of worms – there's going to be a lot of follow up issues so I'll try to contain the scope of the work and leave polishing to subsequent PRs.

asfimport · 2019-07-26T23:54:38Z

Wes McKinney / @wesm:
Issue resolved by pull request 4949
#4949

asfimport closed this as completed Jul 26, 2019

asfimport assigned wesm Jan 10, 2023

asfimport added this to the 0.15.0 milestone Jan 11, 2023

asheeshgarg mentioned this issue Aug 4, 2023

Polars Based Compute Engine apache/iceberg#7067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray #20110

[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray #20110

asfimport commented Jun 13, 2018 •

edited

Loading

asfimport commented Jun 13, 2018

asfimport commented Jun 13, 2018

asfimport commented Nov 12, 2018

asfimport commented Feb 28, 2019

asfimport commented Jun 19, 2019

asfimport commented Jul 18, 2019

asfimport commented Jul 18, 2019

asfimport commented Jul 18, 2019

asfimport commented Jul 21, 2019

asfimport commented Jul 26, 2019

[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray #20110

[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray #20110

Comments

asfimport commented Jun 13, 2018 • edited Loading

Related issues:

PRs and other links:

asfimport commented Jun 13, 2018

asfimport commented Jun 13, 2018

asfimport commented Nov 12, 2018

asfimport commented Feb 28, 2019

asfimport commented Jun 19, 2019

asfimport commented Jul 18, 2019

asfimport commented Jul 18, 2019

asfimport commented Jul 18, 2019

asfimport commented Jul 21, 2019

asfimport commented Jul 26, 2019

asfimport commented Jun 13, 2018 •

edited

Loading