Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size #21577

Closed
asfimport opened this issue Apr 2, 2019 · 1 comment
Assignees
Milestone

Comments

@asfimport
Copy link
Collaborator

asfimport commented Apr 2, 2019

Currently, there is a workaround for dict encoded columns in place to handle writing dict encoded columns to parquet.

The workaround converts the dict encoded array to its plain version before writing to parquet. This is painfully slow since for every row group the entire array is converted over and over again.

The following example is orders of magnitude slower than the non-dict encoded version:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame({"col": ["A", "B"] * 100000}).astype("category")
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(
    table,
    buf,
    chunk_size=100,
)
 

Reporter: Florian Jetter / @fjetter
Assignee: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-5089. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
This is resolved in master through my patches related to writing DictionaryArray to Parquet

0.14.1:

In [2]: timeit pq.write_table(table, pa.BufferOutputStream(), chunk_size=100)                               
3.18 s ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: timeit pq.write_table(table, pa.BufferOutputStream())                                               
6.28 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

master (to be 0.15.0)

In [3]: timeit pq.write_table(table, pa.BufferOutputStream(), chunk_size=100)              
27.5 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: timeit pq.write_table(table, pa.BufferOutputStream())                              
5.8 ms ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Indeed, a difference of about 100x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants