Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Support reading non-dictionary encoded binary Parquet columns directly as DictionaryArray #20103

Closed
asfimport opened this issue Sep 25, 2018 · 7 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Sep 25, 2018

If the goal is to hash this data anyway into a categorical-type array, then it would be better to offer the option to "push down" the hashing into the Parquet read hot path rather than first fully materializing a dense vector of ByteArray values, which could use a lot of memory after decompression

Reporter: Wes McKinney / @wesm
Assignee: Hatem Helal / @hatemhelal

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-3769. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Moved this here from the Parquet JIRA

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
This is implemented in in #3492 for PARQUET-1508, but not tested. This JIRA should be used to add unit tests and probably some benchmarks, too

@asfimport
Copy link
Collaborator Author

Hatem Helal / @hatemhelal:
I've started looking into this and starting with some unit tests to make sure I understand the inner workings.  

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Cool. This is only implemented at the encoder level, so you should be able to use ArrayFromJSON to make writing the unit tests easier – so for this JIRA I am expecting tests in parquet-encoding-test

@asfimport
Copy link
Collaborator Author

Hatem Helal / @hatemhelal:
Made a start on the unittests here:

mathworks#12

@wesm, could you take a look and let me know if this is heading in the right direction?

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
yes, I think that's the basic idea

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Issue resolved by pull request 3721
#3721

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant