[C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size #21577

asfimport · 2019-04-02T08:40:16Z

Currently, there is a workaround for dict encoded columns in place to handle writing dict encoded columns to parquet.

The workaround converts the dict encoded array to its plain version before writing to parquet. This is painfully slow since for every row group the entire array is converted over and over again.

The following example is orders of magnitude slower than the non-dict encoded version:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame({"col": ["A", "B"] * 100000}).astype("category")
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(
    table,
    buf,
    chunk_size=100,
)

Reporter: Florian Jetter / @fjetter
Assignee: Wes McKinney / @wesm

Related issues:

[Python][Parquet] direct reading/writing of pandas categoricals in parquet (relates to)

_{Note: This issue was originally created as ARROW-5089. Please see the migration documentation for further details.}

asfimport · 2019-08-16T16:39:01Z

Wes McKinney / @wesm:
This is resolved in master through my patches related to writing DictionaryArray to Parquet

0.14.1:

In [2]: timeit pq.write_table(table, pa.BufferOutputStream(), chunk_size=100)                               
3.18 s ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: timeit pq.write_table(table, pa.BufferOutputStream())                                               
6.28 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

master (to be 0.15.0)

In [3]: timeit pq.write_table(table, pa.BufferOutputStream(), chunk_size=100)              
27.5 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: timeit pq.write_table(table, pa.BufferOutputStream())                              
5.8 ms ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Indeed, a difference of about 100x

asfimport closed this as completed Aug 16, 2019

asfimport assigned wesm Jan 10, 2023

asfimport added this to the 0.15.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[Python][Parquet] direct reading/writing of pandas categoricals in parquet #19588

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size #21577

[C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size #21577

asfimport commented Apr 2, 2019 •

edited

Loading

asfimport commented Aug 16, 2019

[C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size #21577

[C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size #21577

Comments

asfimport commented Apr 2, 2019 • edited Loading

Related issues:

asfimport commented Aug 16, 2019

asfimport commented Apr 2, 2019 •

edited

Loading