Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray #20110

Closed
asfimport opened this issue Jun 13, 2018 · 10 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jun 13, 2018

Dictionary data is very common in parquet, in the current implementation parquet-cpp decodes dictionary encoded data always before creating a plain arrow array. This process is wasteful since we could use arrow's DictionaryArray directly and achieve several benefits:

  1. Smaller memory footprint - both in the decoding process and in the resulting arrow table - especially when the dict values are large

  2. Better decoding performance - mostly as a result of the first bullet - less memory fetches and less allocations.

    I think those benefits could achieve significant improvements in runtime.

    My direction for the implementation is to read the indices (through the DictionaryDecoder, after the RLE decoding) and values separately into 2 arrays and create a DictionaryArray using them.

    There are some questions to discuss:

  3. Should this be the default behavior for dictionary encoded data

  4. Should it be controlled with a parameter in the API

  5. What should be the policy in case some of the chunks are dictionary encoded and some are not.

    I started implementing this but would like to hear your opinions.

Reporter: Stav Nir
Assignee: Wes McKinney / @wesm

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-3772. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Stav Nir:
@wesm Would be great to hear your thoughts on this.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
This has been discussed many other places – there may be a JIRA already about this.

To your questions

  1. No, I think this should be optional
  2. Yes
  3. Probably want to fall back to dense output. We could also dictionary encode the non-dictionary encoding stuff using tools in arrow::compute, but that will be quite a bit of work.

This task is loaded with pitfalls:

  • The dictionary will not be known up front, so the Arrow schema cannot be inferred from the file metadata alone like it is now. The schema will have to be modified later

  • The dictionary will often be different from row group to row group and from file to file – data written by Spark or Impala or most systems will have this problem

  • A ColumnChunk may start with dictionary encoding and then switch to plain encoding mid stream

    It's probably because of these complexities that we have not undertaken this work yet. Your help would be appreciated, though. Keep in mind there's a pretty good sized patch up right now on the Arrow decode path: ARROW-1644: [C++] Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow parquet-cpp#462

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Moved issue to Arrow issue tracker

@asfimport
Copy link
Collaborator Author

Hatem Helal / @hatemhelal:
I'd like to take a stab at this after ARROW-3769

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Realistically I don't think I can get this done this week (or in time for 0.14.0), and I think it would be worth giving the feature some care and attention rather than rushing it. Moving to the 1.0.0 milestone

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I'm looking at this. This is not a small project – the assumption that values are fully materialized is pretty deeply baked into the library. We also have to deal with the "fallback" case where a column chunk starts out dictionary encoded and switches mid-stream because the dictionary got too big. What to do in that case is ambiguous:

  • One option is to dictionary-encode the additional pages, so we could end up with one big dictionary

  • Another option is to optimistically leave things dictionary-encoded, and if we hit the fallback case then we fully materialize. We can always do a cast on the Arrow side after the fact in this case

    FWIW, the fallback scenario is not at all esoteric because the default dictionary pagesize limit in the C++ library is 1MB. I think Java is the same

    https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L44

    I think adding an option to raise the limit to 2GB or so when writing Arrow DictionaryArray would help.

    Things are made a bit more complex by the code duplication between parquet/column_reader.cc and parquet/arrow/record_reader.cc. I'll see if there's some things I can do to fix that while I'm working on this

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
"I'm looking at this. This is not a small project – the assumption that values are fully materialized is pretty deeply baked into the library. We also have to deal with the "fallback" case where a column chunk starts out dictionary encoded and switches mid-stream because the dictionary got too big"

I don't have context on how we decided originally to designate an entire column dictionary encoded vs a chunk/record batch column but it seems like this might be another use-case where the proposal on encoding/compression might make things easier to code (i.e. specify dictionary encoding only on SparseRecordBatches where it makes sense and leave the fallback to dense encoding where it no longer makes sense).

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
At least after ARROW-3144 we have broken the constraint of a constant dictionary across arrays. Having a mix of dictionary-encoded and non-dictionary-encoded arrays is interesting, but regardless there's a lot of refactoring to do in the Parquet library to expose these details

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I'm getting close to having something PR-worthy here, ended up being a can of worms – there's going to be a lot of follow up issues so I'll try to contain the scope of the work and leave polishing to subsequent PRs.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Issue resolved by pull request 4949
#4949

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants