Skip to content

Conversation

@lilianm
Copy link
Contributor

@lilianm lilianm commented Oct 1, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Memory improvement for page v1 with data compression

Are these changes tested?

N/A already tested

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 1, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lilianm -- this seems like a reasonable plan to me. My only concern is that it may slow down the encoder for other usecases as shrink_to_fit may copy the bytes.

I'll start some benchmarks to see if we can see any difference.

If we do, perhaps we can apply some heuristic like only call shrink_to_fit if the buffer is less than half used or something

if let Some(ref mut cmpr) = self.compressor {
let mut compressed_buf = Vec::with_capacity(uncompressed_size);
cmpr.compress(&buffer[..], &mut compressed_buf)?;
compressed_buf.shrink_to_fit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should apply the same optimization to V2 path below 🤔

Also, @mapleFU recently updated the compression check for V2 pages to use the uncompressed values if the compression didn't actually reduce the space. Maybe we should apply that to V1 pages too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost of copy is pretty insignifiant because memcpy speed it's around 10000MB/s and compression speed it's around 600MB/s. Underlayer vector use shink method https://doc.rust-lang.org/alloc/alloc/trait.Allocator.html#method.shrink. In standard malloc threadhold for switch to mmap allocation it's 128k and for shrink the system only unmap page and no need memory copy.

In V2 page buffer is not reserved

For no compress page when compression it's bad i can be a good idea to apply for V1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Oct 2, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing parquet_writer_1_0_compressed_memory_improvement (e79a01e) to ca3b3be diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=parquet_writer_1_0_compressed_memory_improvement
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 2, 2025

🤖: Benchmark completed

Details

group                                                                     main                                   parquet_writer_1_0_compressed_memory_improvement
-----                                                                     ----                                   ------------------------------------------------
write_batch nested/4096 values primitive list                             1.00   1850.0±9.93µs  1152.4 MB/sec    1.00   1852.5±8.16µs  1150.8 MB/sec
write_batch nested/4096 values primitive list non-null                    1.01      2.0±0.02ms  1039.7 MB/sec    1.00      2.0±0.02ms  1053.8 MB/sec
write_batch primitive/4096 values bool                                    1.00     64.3±0.15µs    16.5 MB/sec    1.02     65.4±0.55µs    16.2 MB/sec
write_batch primitive/4096 values bool non-null                           1.00     30.2±0.08µs    18.9 MB/sec    1.01     30.5±0.38µs    18.8 MB/sec
write_batch primitive/4096 values float with NaNs                         1.00    553.3±1.15µs    99.3 MB/sec    1.01    560.0±5.97µs    98.2 MB/sec
write_batch primitive/4096 values primitive                               1.01    790.4±3.49µs   222.6 MB/sec    1.00    780.4±5.14µs   225.4 MB/sec
write_batch primitive/4096 values primitive non-null                      1.00    596.5±2.69µs   289.2 MB/sec    1.00    596.8±5.66µs   289.1 MB/sec
write_batch primitive/4096 values primitive non-null with bloom filter    1.00      7.9±0.25ms    21.9 MB/sec    1.00      7.9±0.26ms    21.8 MB/sec
write_batch primitive/4096 values primitive with bloom filter             1.02      8.0±0.20ms    22.1 MB/sec    1.00      7.8±0.25ms    22.5 MB/sec
write_batch primitive/4096 values string                                  1.01    384.6±0.74µs   328.1 MB/sec    1.00    381.4±5.32µs   330.9 MB/sec
write_batch primitive/4096 values string dictionary                       1.00    513.5±3.47µs  2009.8 MB/sec    1.04    533.5±4.79µs  1934.4 MB/sec
write_batch primitive/4096 values string dictionary with bloom filter     1.00    764.2±8.25µs  1350.4 MB/sec    1.05    800.3±6.90µs  1289.6 MB/sec
write_batch primitive/4096 values string non-null                         1.00  1352.2±12.07µs  1513.9 MB/sec    1.01  1363.5±10.95µs  1501.3 MB/sec
write_batch primitive/4096 values string non-null with bloom filter       1.01      2.2±0.03ms   939.4 MB/sec    1.00      2.2±0.01ms   951.0 MB/sec
write_batch primitive/4096 values string with bloom filter                1.00  1587.3±18.29µs  1290.3 MB/sec    1.03  1631.7±17.31µs  1255.2 MB/sec

@lilianm
Copy link
Contributor Author

lilianm commented Oct 2, 2025

@alamb I don't find bench with compression enable in writer_batch test. Maybe i have missing some things

@alamb
Copy link
Contributor

alamb commented Oct 3, 2025

@alamb I don't find bench with compression enable in writer_batch test. Maybe i have missing some things

You are probably right -- could you could make a new PR to add such a benchmark so that we can evaluate the performance impact of this one?

@lilianm
Copy link
Contributor Author

lilianm commented Oct 6, 2025

@alamb you can find it on issue #8559

@alamb
Copy link
Contributor

alamb commented Oct 6, 2025

I updated this PR to get the new parquet writer benchmarks and am rerunning them. Thank you @lilianm

@alamb
Copy link
Contributor

alamb commented Oct 6, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing parquet_writer_1_0_compressed_memory_improvement (1b967d9) to 760b7b6 diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=parquet_writer_1_0_compressed_memory_improvement
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 6, 2025

🤖: Benchmark completed

Details

group                                     main                                    parquet_writer_1_0_compressed_memory_improvement
-----                                     ----                                    ------------------------------------------------
bool/bloom_filter                         1.00    120.4±0.83µs     8.8 MB/sec     1.02    122.9±1.24µs     8.6 MB/sec
bool/default                              1.00     52.8±0.13µs    20.1 MB/sec     1.04     54.8±0.20µs    19.4 MB/sec
bool/parquet_2                            1.00     67.8±0.20µs    15.6 MB/sec     1.01     68.2±0.52µs    15.6 MB/sec
bool/zstd                                 1.00     63.6±0.20µs    16.7 MB/sec     1.04     66.1±0.52µs    16.0 MB/sec
bool/zstd_parquet_2                       1.00     78.0±0.42µs    13.6 MB/sec     1.01     78.5±0.38µs    13.5 MB/sec
bool_non_null/bloom_filter                1.00     97.5±1.27µs     5.9 MB/sec     1.01     98.1±0.68µs     5.8 MB/sec
bool_non_null/default                     1.00     20.1±0.18µs    28.5 MB/sec     1.04     20.9±0.10µs    27.4 MB/sec
bool_non_null/parquet_2                   1.05     37.6±0.13µs    15.2 MB/sec     1.00     35.9±0.11µs    16.0 MB/sec
bool_non_null/zstd                        1.00     28.9±0.15µs    19.8 MB/sec     1.02     29.4±0.15µs    19.5 MB/sec
bool_non_null/zstd_parquet_2              1.04     46.9±0.29µs    12.2 MB/sec     1.00     45.1±0.35µs    12.7 MB/sec
float_with_nans/bloom_filter              1.00    873.2±3.85µs    62.9 MB/sec     1.00    872.5±6.83µs    63.0 MB/sec
float_with_nans/default                   1.01    524.3±1.58µs   104.8 MB/sec     1.00    520.7±1.10µs   105.6 MB/sec
float_with_nans/parquet_2                 1.01    777.7±2.46µs    70.7 MB/sec     1.00    772.8±3.52µs    71.1 MB/sec
float_with_nans/zstd                      1.00    696.6±1.83µs    78.9 MB/sec     1.00    696.0±2.93µs    79.0 MB/sec
float_with_nans/zstd_parquet_2            1.00    954.5±3.20µs    57.6 MB/sec     1.00    952.3±3.45µs    57.7 MB/sec
list_primitive/bloom_filter               1.00      2.3±0.02ms   918.5 MB/sec     1.01      2.3±0.02ms   910.1 MB/sec
list_primitive/default                    1.00  1670.1±18.55µs  1276.4 MB/sec     1.01   1682.9±6.91µs  1266.7 MB/sec
list_primitive/parquet_2                  1.00   1701.1±7.61µs  1253.2 MB/sec     1.01   1717.4±8.63µs  1241.3 MB/sec
list_primitive/zstd                       1.00      2.9±0.01ms   731.2 MB/sec     1.01      3.0±0.02ms   721.8 MB/sec
list_primitive/zstd_parquet_2             1.00      3.0±0.01ms   722.0 MB/sec     1.00      3.0±0.01ms   720.2 MB/sec
list_primitive_non_null/bloom_filter      1.00      2.7±0.02ms   798.7 MB/sec     1.02      2.7±0.04ms   780.7 MB/sec
list_primitive_non_null/default           1.00   1732.2±8.49µs  1228.1 MB/sec     1.00  1737.1±11.70µs  1224.6 MB/sec
list_primitive_non_null/parquet_2         1.00   1885.8±8.84µs  1128.1 MB/sec     1.03  1936.8±15.06µs  1098.3 MB/sec
list_primitive_non_null/zstd              1.00      3.9±0.02ms   548.9 MB/sec     1.02      3.9±0.03ms   538.7 MB/sec
list_primitive_non_null/zstd_parquet_2    1.00      4.0±0.02ms   533.1 MB/sec     1.03      4.1±0.03ms   515.9 MB/sec
primitive/bloom_filter                    1.00  1816.5±749.62µs    96.9 MB/sec    2.26      4.1±0.16ms    42.8 MB/sec
primitive/default                         1.00    683.5±3.84µs   257.4 MB/sec     1.00    686.3±3.48µs   256.4 MB/sec
primitive/parquet_2                       1.00    692.1±2.20µs   254.2 MB/sec     1.00    692.4±2.71µs   254.1 MB/sec
primitive/zstd                            1.00    977.8±3.14µs   179.9 MB/sec     1.02    995.3±5.29µs   176.8 MB/sec
primitive/zstd_parquet_2                  1.00    933.5±4.66µs   188.5 MB/sec     1.01    938.9±9.78µs   187.4 MB/sec
primitive_non_null/bloom_filter           1.00  1495.1±25.62µs   115.4 MB/sec     1.00  1490.6±28.22µs   115.7 MB/sec
primitive_non_null/default                1.01    508.2±3.12µs   339.5 MB/sec     1.00    501.1±2.57µs   344.2 MB/sec
primitive_non_null/parquet_2              1.01    512.4±3.34µs   336.7 MB/sec     1.00    505.2±3.61µs   341.5 MB/sec
primitive_non_null/zstd                   1.00    793.4±9.19µs   217.4 MB/sec     1.00    794.9±6.21µs   217.0 MB/sec
primitive_non_null/zstd_parquet_2         1.00    795.6±6.51µs   216.8 MB/sec     1.01    802.4±3.79µs   215.0 MB/sec
string/bloom_filter                       1.00   1257.8±8.01µs  1628.3 MB/sec     1.00  1258.8±12.72µs  1627.0 MB/sec
string/default                            1.00    776.0±6.71µs     2.6 GB/sec     1.00    777.4±4.40µs     2.6 GB/sec
string/parquet_2                          1.00    774.7±7.51µs     2.6 GB/sec     1.01    786.3±3.53µs     2.5 GB/sec
string/zstd                               1.00      2.3±0.01ms   885.1 MB/sec     1.00      2.3±0.01ms   882.0 MB/sec
string/zstd_parquet_2                     1.00      2.3±0.01ms   882.8 MB/sec     1.01      2.3±0.01ms   877.2 MB/sec
string_and_binary_view/bloom_filter       1.01    598.1±7.81µs   211.0 MB/sec     1.00    591.2±8.36µs   213.5 MB/sec
string_and_binary_view/default            1.00    350.3±1.01µs   360.2 MB/sec     1.01    352.3±4.58µs   358.3 MB/sec
string_and_binary_view/parquet_2          1.00    351.8±3.68µs   358.8 MB/sec     1.01    355.1±4.64µs   355.4 MB/sec
string_and_binary_view/zstd               1.00    602.0±2.03µs   209.6 MB/sec     1.00    602.5±1.98µs   209.5 MB/sec
string_and_binary_view/zstd_parquet_2     1.00    594.4±1.60µs   212.3 MB/sec     1.00    596.7±2.40µs   211.5 MB/sec
string_dictionary/bloom_filter            1.00    615.7±5.43µs  1676.1 MB/sec     1.00    616.4±5.65µs  1674.2 MB/sec
string_dictionary/default                 1.00    393.8±2.63µs     2.6 GB/sec     1.00    394.2±5.61µs     2.6 GB/sec
string_dictionary/parquet_2               1.01    395.6±2.76µs     2.5 GB/sec     1.00    393.2±1.69µs     2.6 GB/sec
string_dictionary/zstd                    1.00   1165.8±9.33µs   885.2 MB/sec     1.00   1163.8±7.16µs   886.7 MB/sec
string_dictionary/zstd_parquet_2          1.00   1156.5±5.75µs   892.3 MB/sec     1.00   1161.9±7.90µs   888.2 MB/sec
string_non_null/bloom_filter              1.00  1757.0±22.35µs  1165.1 MB/sec     1.00  1761.2±23.67µs  1162.4 MB/sec
string_non_null/default                   1.00  1126.2±10.36µs  1817.7 MB/sec     1.00  1131.6±11.18µs  1809.1 MB/sec
string_non_null/parquet_2                 1.00  1137.1±14.10µs  1800.3 MB/sec     1.00  1140.4±10.03µs  1795.1 MB/sec
string_non_null/zstd                      1.00      3.2±0.02ms   639.9 MB/sec     1.00      3.2±0.02ms   637.6 MB/sec
string_non_null/zstd_parquet_2            1.00      3.2±0.01ms   640.7 MB/sec     1.01      3.2±0.02ms   632.4 MB/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My analysis of the benchmark results is that this PR does not change performnace significantly

Thank you @lilianm for your contribution and attention to detail 🦾

@alamb alamb merged commit 2273cdc into apache:main Oct 7, 2025
16 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 7, 2025

Thanks again @lilianm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet] Over memory consumation for writer page v1 compressed

2 participants