ARROW-3408: [C++] Add CSV option to automatically attempt dict encoding #5785

pitrou · 2019-11-06T16:08:51Z

This is tied to type inference, and only triggers on string or binary columns.
Each chunk is dict-encoded up to a certain cardinality, after which the whole
column falls back on plain encoding.

pitrou · 2019-11-06T16:11:21Z

Quick performance rundown:

dict encoding is ~30% slower in single-thread mode than regular encoding
dict encoding is as fast in multi-thread mode as regular encoding

Also, if auto dict encoding is enabled but max cardinality is reached in the first chunks, performance doesn't suffer

github-actions · 2019-11-06T16:16:00Z

https://issues.apache.org/jira/browse/ARROW-3408

pitrou · 2019-11-06T16:31:33Z

@wesm

wesm

This seems good to me, will be pretty valuable as far as reducing memory use.

Some high level questions (and I haven't read closely enough to determine the answers already)

Are different converted chunks allowed to yield different dictionaries? Overall I would say there's little benefit to going to extra effort to have a "global" dictionary. This might be made configurable also
Is the max cardinality enforced on a per-chunk basis or globally? How does this impact parallel conversions? In a worst case scenario you could imagine chunks having smallish dictionaries but the global dictionary may be larger than the threshold

wesm · 2019-11-07T01:56:41Z

cpp/src/arrow/csv/column_builder_test.cc

In the future this could certainly be const T& to save on some typing below

wesm · 2019-11-07T01:57:46Z

cpp/src/arrow/csv/column_builder_test.cc

Nice code-deduping in this file

cpp/src/arrow/csv/converter.h

wesm · 2019-11-07T02:11:34Z

cpp/src/arrow/csv/options.h

Would it be possible to indicate to use dictionary encoding explicitly for a column (would probably have to be forced to yield int32 indices)? Can be follow up work

Probably as a followup JIRA, since the configuration mechanism must be discussed.

cpp/src/arrow/csv/converter.cc

pitrou · 2019-11-07T10:31:37Z

Are different converted chunks allowed to yield different dictionaries?

Yes. This is mandatory for parallel processing, actually.
(we could reconcile them at the end, though it might as well be a manual step by the user)

Is the max cardinality enforced on a per-chunk basis or globally?

On a per-chunk basis.

pitrou · 2019-11-07T10:33:16Z

Rebased and comments addressed.

This is tied to type inference, and only triggers on string or binary columns. Each chunk is dict-encoded up to a certain cardinality, after which the whole column falls back on plain encoding.

wesm · 2019-11-07T16:34:25Z

On a per-chunk basis.

What will happen if one chunk overflows the limit but others do not? Will the dictionary-encoded chunks be casted to dense?

pitrou · 2019-11-07T16:35:16Z

Will the dictionary-encoded chunks be casted to dense?

Yes, all of them.
(concretely, they are not cast, conversion is done again)

wesm · 2019-11-07T16:48:20Z

Got it, thanks, wasn't sure if there were tests specifically about this but I trust you =)

+1

pitrou changed the title ~~Arrow 3408 csv auto dict encode~~ ARROW-3408: [C++] Add CSV option to automatically attempt dict encoding Nov 6, 2019

pitrou force-pushed the ARROW-3408-csv-auto-dict-encode branch from 2592a2d to d9fe437 Compare November 6, 2019 16:09

pitrou force-pushed the ARROW-3408-csv-auto-dict-encode branch from d9fe437 to abca07d Compare November 6, 2019 16:14

pitrou force-pushed the ARROW-3408-csv-auto-dict-encode branch from abca07d to 1a9016e Compare November 6, 2019 16:31

wesm self-requested a review November 6, 2019 21:09

wesm reviewed Nov 7, 2019

View reviewed changes

fsaintjacques reviewed Nov 7, 2019

View reviewed changes

cpp/src/arrow/csv/converter.cc Outdated Show resolved Hide resolved

cpp/src/arrow/csv/converter.cc Outdated Show resolved Hide resolved

pitrou force-pushed the ARROW-3408-csv-auto-dict-encode branch from 1a9016e to 3ccc14a Compare November 7, 2019 10:33

ARROW-3408: [C++] Add CSV option to automatically attempt dict encoding

8d909fd

This is tied to type inference, and only triggers on string or binary columns. Each chunk is dict-encoded up to a certain cardinality, after which the whole column falls back on plain encoding.

pitrou force-pushed the ARROW-3408-csv-auto-dict-encode branch from 3ccc14a to 8d909fd Compare November 7, 2019 10:54

wesm closed this in b71d280 Nov 7, 2019

pitrou deleted the ARROW-3408-csv-auto-dict-encode branch November 7, 2019 17:00

ntfshard mentioned this pull request Nov 29, 2019

[C++] CSV string column category to dictionary/indices? #5927

Closed

snyk-bot mentioned this pull request Oct 23, 2020

[Snyk] Upgrade flatbuffers from 1.9.0 to 1.12.0 rs2/arrow#5

Open

snyk-bot mentioned this pull request Dec 3, 2020

[Snyk] Upgrade flatbuffers from 1.11.0 to 1.12.0 Xtuden-com/arrow#5

Open

asfimport mentioned this pull request Apr 10, 2020

[C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns #19735

Closed

ARROW-3408: [C++] Add CSV option to automatically attempt dict encoding #5785

ARROW-3408: [C++] Add CSV option to automatically attempt dict encoding #5785

Uh oh!

Conversation

pitrou commented Nov 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Nov 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 6, 2019

Uh oh!

pitrou commented Nov 6, 2019

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

wesm Nov 7, 2019

Choose a reason for hiding this comment

Uh oh!

wesm Nov 7, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wesm Nov 7, 2019

Choose a reason for hiding this comment

Uh oh!

pitrou Nov 7, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pitrou commented Nov 7, 2019

Uh oh!

pitrou commented Nov 7, 2019

Uh oh!

wesm commented Nov 7, 2019

Uh oh!

pitrou commented Nov 7, 2019

Uh oh!

wesm commented Nov 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pitrou commented Nov 6, 2019 •

edited

Loading

pitrou commented Nov 6, 2019 •

edited

Loading