-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray #20110
Comments
Wes McKinney / @wesm: To your questions
This task is loaded with pitfalls:
|
Wes McKinney / @wesm: |
Hatem Helal / @hatemhelal: |
Wes McKinney / @wesm: |
Wes McKinney / @wesm:
|
Micah Kornfield / @emkornfield: I don't have context on how we decided originally to designate an entire column dictionary encoded vs a chunk/record batch column but it seems like this might be another use-case where the proposal on encoding/compression might make things easier to code (i.e. specify dictionary encoding only on SparseRecordBatches where it makes sense and leave the fallback to dense encoding where it no longer makes sense). |
Wes McKinney / @wesm: |
Wes McKinney / @wesm: |
Wes McKinney / @wesm: |
Dictionary data is very common in parquet, in the current implementation parquet-cpp decodes dictionary encoded data always before creating a plain arrow array. This process is wasteful since we could use arrow's DictionaryArray directly and achieve several benefits:
Smaller memory footprint - both in the decoding process and in the resulting arrow table - especially when the dict values are large
Better decoding performance - mostly as a result of the first bullet - less memory fetches and less allocations.
I think those benefits could achieve significant improvements in runtime.
My direction for the implementation is to read the indices (through the DictionaryDecoder, after the RLE decoding) and values separately into 2 arrays and create a DictionaryArray using them.
There are some questions to discuss:
Should this be the default behavior for dictionary encoded data
Should it be controlled with a parameter in the API
What should be the policy in case some of the chunks are dictionary encoded and some are not.
I started implementing this but would like to hear your opinions.
Reporter: Stav Nir
Assignee: Wes McKinney / @wesm
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-3772. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: