Skip to content

Conversation

@klion26
Copy link
Member

@klion26 klion26 commented Jul 24, 2025

This commit will reuse parent buffer for ListBuilder, so that it doesn't need to copy the buffer when finishing the builder.

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

This pr wants to avoid the extra buffer allocation in ListBuilder.

What changes are included in this PR?

  • Reuse the parent's buffer when creating a ListBuilder, all contents will be written to the buffer of the parent directly
  • When ListBuilder::finish, we'll fill the header for the current list in the parent's buffer
  • Will roll back the value of has written into the parent's buffer in drop if ListBuilder::finish has not been called.

Are these changes tested?

The change was covered by existing tests mainly test_nested_list_with_heterogeneous_fields_for_buffer_reuse

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 24, 2025
@klion26
Copy link
Member Author

klion26 commented Jul 24, 2025

@alamb @scovich @viirya, please help review this when you're free, thanks.

I've created benchmarks for various implementations. The current implementation is the winner, and the alternatives are

  1. Current implementation with PackedU32Iterator
  2. Splice with Iterator (code here)
  3. Collect the header with iterator before splice(code here)
  4. Splice with actual header bytes (code here)

The benchmark comparison result from my laptop

The steps are

  1. Created all four branches with modifications
  2. run the command cargo bench --features=arrow,async,test_common,experimental --bench variant_kernels -- --save-baseline $BRANCH_NAME on each branch
  3. run the command critcmp main $BRANCH_NAME to get the compare result

1 PackedU32 Iterator

group                                                                7977_packedu32_iterator                main
-----                                                                -----------------------                ----
batch_json_string_to_variant json_list 8k string                     1.00     41.7±5.53ms        ? ?/sec    1.22     51.0±7.14ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00   414.0±41.45ms        ? ?/sec    1.11   458.7±48.08ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.00     15.7±2.04ms        ? ?/sec    1.01     15.9±1.67ms        ? ?/sec
variant_get_primitive                                                1.09      2.7±0.34ms        ? ?/sec    1.00      2.5±0.28ms        ? ?/sec

2 Splice with Iterator

group                                                                7977_avoid_allocation_for_list_builder    main
-----                                                                --------------------------------------    ----
batch_json_string_to_variant json_list 8k string                     1.00     46.7±6.23ms        ? ?/sec       1.09     51.0±7.14ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00   418.0±42.38ms        ? ?/sec       1.10   458.7±48.08ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.00     15.9±1.97ms        ? ?/sec       1.00     15.9±1.67ms        ? ?/sec
variant_get_primitive                                                1.01      2.5±0.28ms        ? ?/sec       1.00      2.5±0.28ms        ? ?/sec

3 Collect the header with the iterator before splice

group                                                                7977_collect_before_splice             main
-----                                                                --------------------------             ----
batch_json_string_to_variant json_list 8k string                     1.00     46.4±4.60ms        ? ?/sec    1.10     51.0±7.14ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00   424.5±43.27ms        ? ?/sec    1.08   458.7±48.08ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.00     15.9±1.83ms        ? ?/sec    1.00     15.9±1.67ms        ? ?/sec
variant_get_primitive                                                1.02      2.5±0.31ms        ? ?/sec    1.00      2.5±0.28ms        ? ?/sec

4 Splice with actual header bytes

group                                                                7977_fill_before_splice                main
-----                                                                -----------------------                ----
batch_json_string_to_variant json_list 8k string                     1.00     45.1±2.68ms        ? ?/sec    1.13     51.0±7.14ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00   419.6±40.92ms        ? ?/sec    1.09   458.7±48.08ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.04     16.5±1.20ms        ? ?/sec    1.00     15.9±1.67ms        ? ?/sec
variant_get_primitive                                                1.12      2.8±0.26ms        ? ?/sec    1.00      2.5±0.28ms        ? ?/sec

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified that the drop for ListBuilder was covered with cargo llvm-cov --html test -p parquet-variant

image

@klion26 klion26 force-pushed the 7977_avoid_extra_buffer_with_packedu32_iterator branch from c209728 to 51e3fa9 Compare July 24, 2025 05:04
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add clone() here to make compile happy, or the compile will throw cannot move out of type ListBuilder<'_>, which implements the Drop trait

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do self.offsets.iter().map(|&offset| ...), relying on the fact that u32 is Copy -- instead of cloning the whole vec?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or,

let offsets = std::mem::take(self.offsets).into_iter();
let offsets = offsets.map(|offset| (offset as u32).to_le_bytes());
let offsets = Packedu32Iterator::new(offset_size as usize, offsets);

This commit will reuse parent buffer for ListBuilder,
so that it doesn't need to copy the buffer when finishing the builder.
@klion26 klion26 force-pushed the 7977_avoid_extra_buffer_with_packedu32_iterator branch from 51e3fa9 to e4603c1 Compare July 24, 2025 05:18
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

I've created benchmarks for various implementations.

Do the benchmarks cover different offset sizes, is_large true/false, etc? Or are they always the same offset size?

The current implementation is the winner, and the alternatives are

  1. Current implementation with PackedU32Iterator

This one unnecessarily clones the offsets array; based on the other benchmark results, I would expect removing that to speed up the runs by ~4ms.

  1. Splice with Iterator (code here)

This one will perform poorly because the chained iterator doesn't infer an accurate lower bound, so Vec::splice has to shift bytes twice (once to fit the lower bound, and again to fix the remainder).

  1. Collect the header with iterator before splice(code here)

No reason to expect this would be faster than 2/, because it allocates and immediately consumes an extra Vec

  1. Splice with actual header bytes (code here)

This is still iterator-based like 1/, but with all the unsafety of indexing into a pre-allocated temp buffer (and the overhead of allocating said temp buffer).

A fifth approach would be to use the packed u32 iterator from 1/, and splice in a pre-populated temp buffer like 5/, but to populate the temp buffer by push+extend calls instead of chain+collect:

let mut bytes_to_splice = vec![header];
  ...
bytes_to_splice.extend(num_elements_bytes);
  ...
bytes_to_splice.extend(offsets);
  ...
bytes_to_splice.extend(data_size_bytes);
buffer
    .inner_mut()
    .splice(starting_offset..starting_offset, bytes_to_splice);

I would expect that to outperform 5/ and possibly match 1/, but not necessarily outperform clone-free 1/.

A sixth approach would also use a pre-populated temp buffer, but ditch the packed u32 iterator from 1/ and just directly append the bytes:

fn append_packed_u32(dest: &mut Vec<u8>, value: u32, value_bytes: usize) {
    let n = dest.len() + value_bytes;
    dest.extend(value.to_le_bytes());
    dest.truncate(n);
}

// Calculated header size becomes a hint; being wrong only risks extra allocations.
// Make sure to reserve enough capacity to handle the extra bytes we'll truncate.
let mut bytes_to_splice = Vec::with_capacity(header_size + 3);
bytes_to_splice.push(header);

append_packed_u32(&mut bytes_to_splice, num_elements, if is_large { 4 } else { 1 });

for offset in std::mem::take(self.offsets) {
    append_packed_u32(&mut bytes_to_splice, offset as u32, offset_size as usize);
}

append_packed_u32(&mut bytes_to_splice, data_size as u32, offset_size as usize);

buffer
    .inner_mut()
    .splice(starting_offset..starting_offset, bytes_to_splice);

This one should be a lot faster than a chained iterator (and works equally well regardless of how many bytes we pack to), but pays for the extra temp buffer allocation. I suspect it will be faster than even optimized 1/, but the extra allocation may prove too expensive.

Comment on lines 1115 to 1116
let next_item = self.iterator.next()?;
self.current_item = next_item;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into two statements due to lifetime issues, I suppose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do self.offsets.iter().map(|&offset| ...), relying on the fact that u32 is Copy -- instead of cloning the whole vec?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or,

let offsets = std::mem::take(self.offsets).into_iter();
let offsets = offsets.map(|offset| (offset as u32).to_le_bytes());
let offsets = Packedu32Iterator::new(offset_size as usize, offsets);

@klion26
Copy link
Member Author

klion26 commented Jul 25, 2025

@scovich thanks for the detailed review and suggestion.

Do the benchmarks cover different offset sizes, is_large true/false, etc? Or are they always the same offset size?

The benchmark contains vary lenght of lists, but all the lenght less than 255 -- is_large is always false

I've run the following benchmarks

  1. previous 1 with clone-free
  2. Fifth approach
  3. Sixth approach

The results show that the sixthis better. I've updated the implementation to the sixth approach.

The steps to generate the result:

  1. Change the previous 1 approach to clone-free
  2. Code fifth approach
  3. Code sixth approach
  4. Execute the command cargo bench --features=arrow,async,test_common,experimental --bench variant_kernels -- --save-baseline $BRANCH_NAME for the three approaches and the main branch, one by one.
  5. execute the command critcmp main ${BRANCH_NAME} for the three approaches.

pervious 1 with clone-free

group                                                                7977_avoid_extra_buffer_with_packedu32_iterator    main
-----                                                                -----------------------------------------------    ----
batch_json_string_to_variant json_list 8k string                     1.00     33.5±1.17ms        ? ?/sec                1.10     36.8±1.49ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00    338.9±7.71ms        ? ?/sec                1.10    373.5±8.07ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.02     13.6±0.45ms        ? ?/sec                1.00     13.3±0.47ms        ? ?/sec
variant_get_primitive                                                1.00      2.1±0.07ms        ? ?/sec                1.01      2.1±0.07ms        ? ?/sec

fifth approach

group                                                                7977_pre_populate_with_push_exten      main
-----                                                                ---------------------------------      ----
batch_json_string_to_variant json_list 8k string                     1.00     35.8±1.52ms        ? ?/sec    1.03     36.8±1.49ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00    342.9±9.65ms        ? ?/sec    1.09    373.5±8.07ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.00     13.1±0.46ms        ? ?/sec    1.01     13.3±0.47ms        ? ?/sec
variant_get_primitive                                                1.00      2.1±0.07ms        ? ?/sec    1.00      2.1±0.07ms        ? ?/sec
code for fifth approach
    let header = array_header(is_large, offset_size);

    let mut bytes_to_splice = vec![header];
    let num_elements_bytes =
        num_elements
            .to_le_bytes()
            .into_iter()
            .take(if is_large { 4 } else { 1 });
    bytes_to_splice.extend(num_elements_bytes);
    let offsets = PackedU32Iterator::new(
        offset_size as usize,
        self.offsets
            .iter()
            .map(|&offset| (offset as u32).to_le_bytes()),
    );
    bytes_to_splice.extend(offsets);
    let data_size_bytes = data_size
        .to_le_bytes()
        .into_iter()
        .take(offset_size as usize);
    bytes_to_splice.extend(data_size_bytes);

    buffer
        .inner_mut()
        .splice(starting_offset..starting_offset, bytes_to_splice);

sixth approach

group                                                                7977_pre_populate_with_directly_append_bytes    main
-----                                                                --------------------------------------------    ----
batch_json_string_to_variant json_list 8k string                     1.00     33.7±1.21ms        ? ?/sec             1.09     36.8±1.49ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00    333.4±7.89ms        ? ?/sec             1.12    373.5±8.07ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.00     13.2±0.46ms        ? ?/sec             1.01     13.3±0.47ms        ? ?/sec
variant_get_primitive                                                1.00      2.1±0.08ms        ? ?/sec             1.00      2.1±0.07ms        ? ?/sec```

@alamb
Copy link
Contributor

alamb commented Jul 25, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 7977_avoid_extra_buffer_with_packedu32_iterator (bec3ba8) to ec81db3 diff
BENCH_NAME=variant_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=7977_avoid_extra_buffer_with_packedu32_iterator
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @klion26 -- this is looking very nice.

I had a question about the use of splice vs just shifting the vec over and appending the bytes. However, I think this PR is already an improvement over what is on main so we could also merge it as is an revisit the allocations

I also kicked off the benchmarks and hopefully we'll see some good results

parent_value_offset_base: usize,
/// The starting offset in the parent's metadata buffer where this list starts
/// used to truncate the written fields in `drop` if the current list has not been finished
parent_metadata_offset_base: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good idea

let data_size = self.buffer.offset();
let buffer = self.parent_state.buffer();

let data_size = buffer.offset() - self.parent_value_offset_base;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do a checked sub here to avoid underflow? An underflow would only happen with a bug in the implementation so this is probably fine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

let metadata = VariantMetadata::try_new(&metadata).unwrap();
assert_eq!(metadata.len(), 1);
assert_eq!(&metadata[0], "name"); // not rolled back
assert!(metadata.is_empty()); // rolled back
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@alamb
Copy link
Contributor

alamb commented Jul 25, 2025

🤖: Benchmark completed

Details

group                                                                7977_avoid_extra_buffer_with_packedu32_iterator    main
-----                                                                -----------------------------------------------    ----
batch_json_string_to_variant json_list 8k string                     1.00     28.3±0.13ms        ? ?/sec                1.02     28.8±0.10ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00    335.7±1.21ms        ? ?/sec                1.09    366.4±4.37ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.00      8.2±0.02ms        ? ?/sec                1.00      8.2±0.02ms        ? ?/sec
variant_get_primitive                                                1.00   1359.2±2.98µs        ? ?/sec                1.03   1396.4±3.18µs        ? ?/sec

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

The results show that the sixthis better. I've updated the implementation to the sixth approach.

Maybe we should update the PR description as well?

I had a question about the use of splice vs just shifting the vec over and appending the bytes

What do you mean by "shifting" and "appending" sorry? The buffer already contains the value bytes by the time we know the header info, so AFAIK we only have three choices:

  1. Guess (correctly!) beforehand how many header bytes are needed, and allocate space for them before appending the value bytes. Error-prone unless splice is used to replace the pre-allocated space with the actual header bytes.
  2. Directly splice in the header bytes (what this PR does). Safe, but still has to shift bytes over.
  3. Splice in a zero-byte region of the correct size to shift the bytes, and then loop back over in order to populate the region. Error-prone but doesn't need a temp vector.

Were you referring to one of the above? Or something else?


let header_size = 1 + // header
if is_large { 4 } else { 1 } + // is_large
(self.offsets.len() + 1) * offset_size as usize; // offsets and data size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(self.offsets.len() + 1) * offset_size as usize; // offsets and data size
(num_elements + 1) * offset_size as usize; // offsets and data size

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 1224 to 1225
let header_size = 1 + // header
if is_large { 4 } else { 1 } + // is_large
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let header_size = 1 + // header
if is_large { 4 } else { 1 } + // is_large
let num_elements_size = if is_large { 4 } else { 1 }
let header_size = 1 + // header
num_elements_size + // num_elements

(and then can reuse num_elements_size below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

append_packed_u32(
&mut bytes_to_splice,
num_elements as u32,
if is_large { 4 } else { 1 },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if is_large { 4 } else { 1 },
num_elements_size,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 1120 to 1122
parent_value_offset_base: offset_base,
has_been_finished: false,
parent_metadata_offset_base: meta_offset_base,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're anyway doing :, why not just fold in the logic directly?

Suggested change
parent_value_offset_base: offset_base,
has_been_finished: false,
parent_metadata_offset_base: meta_offset_base,
parent_value_offset_base: parent_state.buffer_current_offset(),
has_been_finished: false,
parent_metadata_offset_base: parent_state.metadata_current_offset(),

Alternatively, the let above could give the correct name from the start, so it can just be passed directly:

Suggested change
parent_value_offset_base: offset_base,
has_been_finished: false,
parent_metadata_offset_base: meta_offset_base,
parent_value_offset_base,
has_been_finished: false,
parent_metadata_offset_base,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has changed the local variable name, the current implementation aims to make the compiler happy, as parent_state has been moved before(the first parameter).

buf[start_pos..start_pos + nbytes as usize].copy_from_slice(&bytes[..nbytes as usize]);
}

/// Append `value_bytes` of given `value` into `dest`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value_bytes is the byte width of the value?

Suggested change
/// Append `value_bytes` of given `value` into `dest`.
/// Append `value_bytes` bytes of given `value` into `dest`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could just call it value_size like most of the other parts of the code do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +1228 to +1229
// Calculated header size becomes a hint; being wrong only risks extra allocations.
// Make sure to reserve enough capacity to handle the extra bytes we'll truncate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, can we rephrase the comment, I don't quite get what it means. Do you try to say the header_size is just a hint and we will allocate extra space?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When header_size will be incorrect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When header_size will be incorrect?

The size if calculated separately, and then the actual bytes are appended. That opens up a bug surface -- any time the two disagree, header_size will be wrong. If the code directly relied on the size being correct, e.g. because we allocate that many bytes and then index them, we could have produce a bad variant value (either because there's an extra run of inserted bytes, or because of a buffer overflow while indexing. But because the calculated size is only a capacity hint for the vec, the cost of being wrong is very low.

let starting_offset = self.parent_value_offset_base;

let header_size = 1 + // header
if is_large { 4 } else { 1 } + // is_large
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if is_large { 4 } else { 1 } + // is_large
if is_large { 4 } else { 1 } + // is_large: 4 bytes, else 1 byte.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

let starting_offset = parent_buffer.offset();
let starting_offset = self.parent_value_offset_base;

let header_size = 1 + // header
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let header_size = 1 + // header
let header_size = 1 + // header (i.e., `array_header`)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@alamb
Copy link
Contributor

alamb commented Jul 26, 2025

What do you mean by "shifting" and "appending" sorry? The buffer already contains the value bytes by the time we know the header info, so AFAIK we only have three choices:

  1. Guess (correctly!) beforehand how many header bytes are needed, and allocate space for them before appending the value bytes. Error-prone unless splice is used to replace the pre-allocated space with the actual header bytes.
  2. Directly splice in the header bytes (what this PR does). Safe, but still has to shift bytes over.
  3. Splice in a zero-byte region of the correct size to shift the bytes, and then loop back over in order to populate the region. Error-prone but doesn't need a temp vector.

Were you referring to one of the above? Or something else?

I meant 3. specifically, https://github.com/apache/arrow-rs/pull/7987/files#diff-19c7b0b0d73ef11489af7932f49046a19ec7790896a8960add5a3ded21d5657aR1230 ( I thought I left a specific comment about this but I can't find it now 🤔 )

Basically rather than allocating a new temporary vector to create the header and then splicing those bytes in like

        let mut bytes_to_splice = Vec::with_capacity(header_size + 3);
        // .... build header
        // slice 
        buffer
            .inner_mut()
            .splice(starting_offset..starting_offset, bytes_to_splice);

I meant avoiding that allocation by shifting the byes over in one go and then writing directly into the output buffer:

        // insert header_size bytes into the output, shifting existing bytes down
        buffer.splice(starting_offset..starting_offset+header_length, std::iter::repeat(0));
       // write header directly into buffer[starting_offset], buffer[starting_offset+1], etc

This looks somewhat similar to what @klion26 did in

Splice with Iterator (code here)

Though in that example the header is created during the insertion

My suggestion is to merge this PR as is and then we can fiddle around with potentially other optimizations as a follow on PR

@alamb
Copy link
Contributor

alamb commented Jul 26, 2025

@klion26 it looks like there are some good suggestions from @viirya and @scovich -- so I will wait to merge this PR until you have a chance to review them. I think it would be fine to either

  1. merge this PR as is and address suggestions as a follow on
  2. address the suggestions directly before merging

Please let us know what you prefer

@scovich
Copy link
Contributor

scovich commented Jul 26, 2025

I meant avoiding that allocation by shifting the byes over in one go and then writing directly into the output buffer:

        // insert header_size bytes into the output, shifting existing bytes down
        buffer.splice(starting_offset..starting_offset+header_length, std::iter::repeat(0));
       // write header directly into buffer[starting_offset], buffer[starting_offset+1], etc

Yeah, IIRC that was the original approach, but I had cautioned that calculating the splice size incorrectly would cause subsequent indexing to corrupt the variant (either by leaving unused zeros or by overflowing the spliced region). And since the original approach was anyway using vec![0u8; header_length] as the source of zeros, I suggested to populate the vec directly instead. Safe and not more expensive.

It could be that not allocating the temp buffer at all does improve performance even further, tho it would come with the risk of corruption if the spliced region were ever the wrong size.

@alamb
Copy link
Contributor

alamb commented Jul 27, 2025

t could be that not allocating the temp buffer at all does improve performance even further, tho it would come with the risk of corruption if the spliced region were ever the wrong size.

I agree

So I think we should proceed with the approach in this PR and then we can go with the even fewer allocations approach if we can get some benchmarks that show it makes any measurable difference

Copy link
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @scovich @viirya Thanks for the review. I've addressed the comments. Sorry for the late response—I was out yesterday and am just back today.

Comment on lines 1120 to 1122
parent_value_offset_base: offset_base,
has_been_finished: false,
parent_metadata_offset_base: meta_offset_base,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has changed the local variable name, the current implementation aims to make the compiler happy, as parent_state has been moved before(the first parameter).

let data_size = self.buffer.offset();
let buffer = self.parent_state.buffer();

let data_size = buffer.offset() - self.parent_value_offset_base;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

buf[start_pos..start_pos + nbytes as usize].copy_from_slice(&bytes[..nbytes as usize]);
}

/// Append `value_bytes` of given `value` into `dest`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 1224 to 1225
let header_size = 1 + // header
if is_large { 4 } else { 1 } + // is_large
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

let starting_offset = parent_buffer.offset();
let starting_offset = self.parent_value_offset_base;

let header_size = 1 + // header
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

let starting_offset = self.parent_value_offset_base;

let header_size = 1 + // header
if is_large { 4 } else { 1 } + // is_large
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


let header_size = 1 + // header
if is_large { 4 } else { 1 } + // is_large
(self.offsets.len() + 1) * offset_size as usize; // offsets and data size
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

append_packed_u32(
&mut bytes_to_splice,
num_elements as u32,
if is_large { 4 } else { 1 },
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@alamb alamb merged commit 73c3e97 into apache:main Jul 28, 2025
12 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 28, 2025

Thanks again @klion26 @scovich @viirya 🚀

@klion26 klion26 deleted the 7977_avoid_extra_buffer_with_packedu32_iterator branch July 29, 2025 06:48
@klion26
Copy link
Member Author

klion26 commented Jul 29, 2025

@alamb @scovich @viirya thanks very much for the review and merging!

@klion26 klion26 restored the 7977_avoid_extra_buffer_with_packedu32_iterator branch July 29, 2025 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Avoid extra allocation in list builder

4 participants