Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter #3871

tustvold · 2023-03-15T09:50:28Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently ArrowWriter buffers up RecordBatch until it has enough rows to populate an entire row group, and then proceeds to write each column in turn to the output buffer.

Describe the solution you'd like

The encoded parquet data is often orders of magnitude smaller than the corresponding arrow data. The read path goes to great lengths to allow incremental reading of data within a row group. It may therefore be desirable to instead encode arrow data eagerly, writing each ColumnChunk to its own temporary buffer, and then stitching these back together.

This would allow writing larger row groups, whilst potentially consuming less memory in the arrow writer.

This would likely involve extending or possibly replacing SerializedRowGroupWriter to allow writing to the same column multiple times

Describe alternatives you've considered

We could not do this, parquet is inherently a read-optimised format and write performance may therefore be less of a priority for many workloads.

Additional context

The text was updated successfully, but these errors were encountered:

alamb · 2023-05-22T19:41:06Z

This ticket will improve https://github.com/influxdata/influxdb_iox/issues/7783 -- thank you for filing it.

As part of this feature, I would like to request some user definable best effort limit of how much memory the parquet writer will buffer (so flush is a function of both "max_row_group_size" as well as "buffer_limit").

If for some reason that is not possible or advisable, exposing the currently buffered size would be ok too (so external users can implement the buffer limiting themselves)

tustvold · 2023-05-23T13:08:30Z

I think #4155 is a precursor to this, as it provides the necessary APIs to be able to encode the columns separately, and then stitch them together again. I therefore intend to work on it first

alamb · 2023-05-23T13:35:49Z

I wonder if you also might think about #1718 "encode the columns in parallel while writing parquet" while working on this.

…ad of RecordBatch (#3871) (#4280) * Buffer Pages in ArrowWriter instead of RecordBatch (#3871) * Review feedback * Improved memory accounting * Clippy

…ad of RecordBatch (apache#3871) (apache#4280) * Buffer Pages in ArrowWriter instead of RecordBatch (apache#3871) * Review feedback * Improved memory accounting * Clippy

tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Mar 15, 2023

tustvold mentioned this issue Mar 15, 2023

Provide an async ParquetWriter for arrow #1269

Closed

tustvold self-assigned this May 22, 2023

This was referenced May 23, 2023

Add Append Column API (#4155) #4269

Merged

Add parquet-concat #4274

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue May 25, 2023

Buffer Pages in ArrowWriter instead of RecordBatch (apache#3871)

780b094

tustvold added a commit to tustvold/arrow-rs that referenced this issue May 25, 2023

Buffer Pages in ArrowWriter instead of RecordBatch (apache#3871)

2ce3ecc

tustvold mentioned this issue May 25, 2023

Improve ArrowWriter memory usage: Buffer Pages in ArrowWriter instead of RecordBatch (#3871) #4280

Merged

tustvold closed this as completed in #4280 May 29, 2023

alamb mentioned this issue May 30, 2023

Patched 40.0.0 with Parquet memory limiting40 alamb/arrow-rs#37

Open

tustvold added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Jun 2, 2023

tustvold mentioned this issue Sep 9, 2023

Support encoding a single parquet file using multiple threads #1718

Closed

tustvold mentioned this issue Oct 25, 2023

Parallelize Serialization of Columns within Parquet RowGroups apache/datafusion#7655

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter #3871

Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter #3871

tustvold commented Mar 15, 2023

alamb commented May 22, 2023

tustvold commented May 23, 2023

alamb commented May 23, 2023

Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter #3871

Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter #3871

Comments

tustvold commented Mar 15, 2023

alamb commented May 22, 2023

tustvold commented May 23, 2023

alamb commented May 23, 2023