-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3408: [C++] Add CSV option to automatically attempt dict encoding #5785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2592a2d to
d9fe437
Compare
|
Quick performance rundown:
Also, if auto dict encoding is enabled but max cardinality is reached in the first chunks, performance doesn't suffer |
d9fe437 to
abca07d
Compare
abca07d to
1a9016e
Compare
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me, will be pretty valuable as far as reducing memory use.
Some high level questions (and I haven't read closely enough to determine the answers already)
- Are different converted chunks allowed to yield different dictionaries? Overall I would say there's little benefit to going to extra effort to have a "global" dictionary. This might be made configurable also
- Is the max cardinality enforced on a per-chunk basis or globally? How does this impact parallel conversions? In a worst case scenario you could imagine chunks having smallish dictionaries but the global dictionary may be larger than the threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future this could certainly be const T& to save on some typing below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice code-deduping in this file
cpp/src/arrow/csv/options.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to indicate to use dictionary encoding explicitly for a column (would probably have to be forced to yield int32 indices)? Can be follow up work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably as a followup JIRA, since the configuration mechanism must be discussed.
Yes. This is mandatory for parallel processing, actually.
On a per-chunk basis. |
1a9016e to
3ccc14a
Compare
|
Rebased and comments addressed. |
This is tied to type inference, and only triggers on string or binary columns. Each chunk is dict-encoded up to a certain cardinality, after which the whole column falls back on plain encoding.
3ccc14a to
8d909fd
Compare
What will happen if one chunk overflows the limit but others do not? Will the dictionary-encoded chunks be casted to dense? |
Yes, all of them. |
|
Got it, thanks, wasn't sure if there were tests specifically about this but I trust you =) +1 |
This is tied to type inference, and only triggers on string or binary columns.
Each chunk is dict-encoded up to a certain cardinality, after which the whole
column falls back on plain encoding.