Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Nov 6, 2019

This is tied to type inference, and only triggers on string or binary columns.
Each chunk is dict-encoded up to a certain cardinality, after which the whole
column falls back on plain encoding.

@pitrou pitrou changed the title Arrow 3408 csv auto dict encode ARROW-3408: [C++] Add CSV option to automatically attempt dict encoding Nov 6, 2019
@pitrou pitrou force-pushed the ARROW-3408-csv-auto-dict-encode branch from 2592a2d to d9fe437 Compare November 6, 2019 16:09
@pitrou
Copy link
Member Author

pitrou commented Nov 6, 2019

Quick performance rundown:

  • dict encoding is ~30% slower in single-thread mode than regular encoding
  • dict encoding is as fast in multi-thread mode as regular encoding

Also, if auto dict encoding is enabled but max cardinality is reached in the first chunks, performance doesn't suffer

@pitrou pitrou force-pushed the ARROW-3408-csv-auto-dict-encode branch from d9fe437 to abca07d Compare November 6, 2019 16:14
@github-actions
Copy link

github-actions bot commented Nov 6, 2019

@pitrou pitrou force-pushed the ARROW-3408-csv-auto-dict-encode branch from abca07d to 1a9016e Compare November 6, 2019 16:31
@pitrou
Copy link
Member Author

pitrou commented Nov 6, 2019

@wesm

@wesm wesm self-requested a review November 6, 2019 21:09
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me, will be pretty valuable as far as reducing memory use.

Some high level questions (and I haven't read closely enough to determine the answers already)

  • Are different converted chunks allowed to yield different dictionaries? Overall I would say there's little benefit to going to extra effort to have a "global" dictionary. This might be made configurable also
  • Is the max cardinality enforced on a per-chunk basis or globally? How does this impact parallel conversions? In a worst case scenario you could imagine chunks having smallish dictionaries but the global dictionary may be larger than the threshold

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future this could certainly be const T& to save on some typing below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice code-deduping in this file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to indicate to use dictionary encoding explicitly for a column (would probably have to be forced to yield int32 indices)? Can be follow up work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably as a followup JIRA, since the configuration mechanism must be discussed.

@pitrou
Copy link
Member Author

pitrou commented Nov 7, 2019

Are different converted chunks allowed to yield different dictionaries?

Yes. This is mandatory for parallel processing, actually.
(we could reconcile them at the end, though it might as well be a manual step by the user)

Is the max cardinality enforced on a per-chunk basis or globally?

On a per-chunk basis.

@pitrou pitrou force-pushed the ARROW-3408-csv-auto-dict-encode branch from 1a9016e to 3ccc14a Compare November 7, 2019 10:33
@pitrou
Copy link
Member Author

pitrou commented Nov 7, 2019

Rebased and comments addressed.

This is tied to type inference, and only triggers on string or binary columns.
Each chunk is dict-encoded up to a certain cardinality, after which the whole
column falls back on plain encoding.
@pitrou pitrou force-pushed the ARROW-3408-csv-auto-dict-encode branch from 3ccc14a to 8d909fd Compare November 7, 2019 10:54
@wesm
Copy link
Member

wesm commented Nov 7, 2019

On a per-chunk basis.

What will happen if one chunk overflows the limit but others do not? Will the dictionary-encoded chunks be casted to dense?

@pitrou
Copy link
Member Author

pitrou commented Nov 7, 2019

Will the dictionary-encoded chunks be casted to dense?

Yes, all of them.
(concretely, they are not cast, conversion is done again)

@wesm
Copy link
Member

wesm commented Nov 7, 2019

Got it, thanks, wasn't sure if there were tests specifically about this but I trust you =)

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants