-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Avoid extra buffer allocation in ListBuilder #7987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Variant] Avoid extra buffer allocation in ListBuilder #7987
Conversation
|
@alamb @scovich @viirya, please help review this when you're free, thanks. I've created benchmarks for various implementations. The current implementation is the winner, and the alternatives are
The benchmark comparison result from my laptop
1 PackedU32 Iterator2 Splice with
|
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c209728 to
51e3fa9
Compare
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add clone() here to make compile happy, or the compile will throw cannot move out of type ListBuilder<'_>, which implements the Drop trait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just do self.offsets.iter().map(|&offset| ...), relying on the fact that u32 is Copy -- instead of cloning the whole vec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or,
let offsets = std::mem::take(self.offsets).into_iter();
let offsets = offsets.map(|offset| (offset as u32).to_le_bytes());
let offsets = Packedu32Iterator::new(offset_size as usize, offsets);This commit will reuse parent buffer for ListBuilder, so that it doesn't need to copy the buffer when finishing the builder.
51e3fa9 to
e4603c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
I've created benchmarks for various implementations.
Do the benchmarks cover different offset sizes, is_large true/false, etc? Or are they always the same offset size?
The current implementation is the winner, and the alternatives are
- Current implementation with
PackedU32Iterator
This one unnecessarily clones the offsets array; based on the other benchmark results, I would expect removing that to speed up the runs by ~4ms.
- Splice with
Iterator(code here)
This one will perform poorly because the chained iterator doesn't infer an accurate lower bound, so Vec::splice has to shift bytes twice (once to fit the lower bound, and again to fix the remainder).
- Collect the header with iterator before splice(code here)
No reason to expect this would be faster than 2/, because it allocates and immediately consumes an extra Vec
- Splice with actual header bytes (code here)
This is still iterator-based like 1/, but with all the unsafety of indexing into a pre-allocated temp buffer (and the overhead of allocating said temp buffer).
A fifth approach would be to use the packed u32 iterator from 1/, and splice in a pre-populated temp buffer like 5/, but to populate the temp buffer by push+extend calls instead of chain+collect:
let mut bytes_to_splice = vec![header];
...
bytes_to_splice.extend(num_elements_bytes);
...
bytes_to_splice.extend(offsets);
...
bytes_to_splice.extend(data_size_bytes);
buffer
.inner_mut()
.splice(starting_offset..starting_offset, bytes_to_splice);I would expect that to outperform 5/ and possibly match 1/, but not necessarily outperform clone-free 1/.
A sixth approach would also use a pre-populated temp buffer, but ditch the packed u32 iterator from 1/ and just directly append the bytes:
fn append_packed_u32(dest: &mut Vec<u8>, value: u32, value_bytes: usize) {
let n = dest.len() + value_bytes;
dest.extend(value.to_le_bytes());
dest.truncate(n);
}
// Calculated header size becomes a hint; being wrong only risks extra allocations.
// Make sure to reserve enough capacity to handle the extra bytes we'll truncate.
let mut bytes_to_splice = Vec::with_capacity(header_size + 3);
bytes_to_splice.push(header);
append_packed_u32(&mut bytes_to_splice, num_elements, if is_large { 4 } else { 1 });
for offset in std::mem::take(self.offsets) {
append_packed_u32(&mut bytes_to_splice, offset as u32, offset_size as usize);
}
append_packed_u32(&mut bytes_to_splice, data_size as u32, offset_size as usize);
buffer
.inner_mut()
.splice(starting_offset..starting_offset, bytes_to_splice);This one should be a lot faster than a chained iterator (and works equally well regardless of how many bytes we pack to), but pays for the extra temp buffer allocation. I suspect it will be faster than even optimized 1/, but the extra allocation may prove too expensive.
parquet-variant/src/builder.rs
Outdated
| let next_item = self.iterator.next()?; | ||
| self.current_item = next_item; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split into two statements due to lifetime issues, I suppose?
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just do self.offsets.iter().map(|&offset| ...), relying on the fact that u32 is Copy -- instead of cloning the whole vec?
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or,
let offsets = std::mem::take(self.offsets).into_iter();
let offsets = offsets.map(|offset| (offset as u32).to_le_bytes());
let offsets = Packedu32Iterator::new(offset_size as usize, offsets);|
@scovich thanks for the detailed review and suggestion.
The benchmark contains vary lenght of lists, but all the lenght less than 255 -- I've run the following benchmarks
The results show that the
pervious
|
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @klion26 -- this is looking very nice.
I had a question about the use of splice vs just shifting the vec over and appending the bytes. However, I think this PR is already an improvement over what is on main so we could also merge it as is an revisit the allocations
I also kicked off the benchmarks and hopefully we'll see some good results
| parent_value_offset_base: usize, | ||
| /// The starting offset in the parent's metadata buffer where this list starts | ||
| /// used to truncate the written fields in `drop` if the current list has not been finished | ||
| parent_metadata_offset_base: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good idea
parquet-variant/src/builder.rs
Outdated
| let data_size = self.buffer.offset(); | ||
| let buffer = self.parent_state.buffer(); | ||
|
|
||
| let data_size = buffer.offset() - self.parent_value_offset_base; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do a checked sub here to avoid underflow? An underflow would only happen with a bug in the implementation so this is probably fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| let metadata = VariantMetadata::try_new(&metadata).unwrap(); | ||
| assert_eq!(metadata.len(), 1); | ||
| assert_eq!(&metadata[0], "name"); // not rolled back | ||
| assert!(metadata.is_empty()); // rolled back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
|
🤖: Benchmark completed Details
|
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
The results show that the
sixthis better. I've updated the implementation to the sixth approach.
Maybe we should update the PR description as well?
I had a question about the use of
splicevs just shifting the vec over and appending the bytes
What do you mean by "shifting" and "appending" sorry? The buffer already contains the value bytes by the time we know the header info, so AFAIK we only have three choices:
- Guess (correctly!) beforehand how many header bytes are needed, and allocate space for them before appending the value bytes. Error-prone unless
spliceis used to replace the pre-allocated space with the actual header bytes. - Directly splice in the header bytes (what this PR does). Safe, but still has to shift bytes over.
- Splice in a zero-byte region of the correct size to shift the bytes, and then loop back over in order to populate the region. Error-prone but doesn't need a temp vector.
Were you referring to one of the above? Or something else?
parquet-variant/src/builder.rs
Outdated
|
|
||
| let header_size = 1 + // header | ||
| if is_large { 4 } else { 1 } + // is_large | ||
| (self.offsets.len() + 1) * offset_size as usize; // offsets and data size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (self.offsets.len() + 1) * offset_size as usize; // offsets and data size | |
| (num_elements + 1) * offset_size as usize; // offsets and data size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
parquet-variant/src/builder.rs
Outdated
| let header_size = 1 + // header | ||
| if is_large { 4 } else { 1 } + // is_large |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let header_size = 1 + // header | |
| if is_large { 4 } else { 1 } + // is_large | |
| let num_elements_size = if is_large { 4 } else { 1 } | |
| let header_size = 1 + // header | |
| num_elements_size + // num_elements |
(and then can reuse num_elements_size below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
parquet-variant/src/builder.rs
Outdated
| append_packed_u32( | ||
| &mut bytes_to_splice, | ||
| num_elements as u32, | ||
| if is_large { 4 } else { 1 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if is_large { 4 } else { 1 }, | |
| num_elements_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
parquet-variant/src/builder.rs
Outdated
| parent_value_offset_base: offset_base, | ||
| has_been_finished: false, | ||
| parent_metadata_offset_base: meta_offset_base, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're anyway doing :, why not just fold in the logic directly?
| parent_value_offset_base: offset_base, | |
| has_been_finished: false, | |
| parent_metadata_offset_base: meta_offset_base, | |
| parent_value_offset_base: parent_state.buffer_current_offset(), | |
| has_been_finished: false, | |
| parent_metadata_offset_base: parent_state.metadata_current_offset(), |
Alternatively, the let above could give the correct name from the start, so it can just be passed directly:
| parent_value_offset_base: offset_base, | |
| has_been_finished: false, | |
| parent_metadata_offset_base: meta_offset_base, | |
| parent_value_offset_base, | |
| has_been_finished: false, | |
| parent_metadata_offset_base, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has changed the local variable name, the current implementation aims to make the compiler happy, as parent_state has been moved before(the first parameter).
parquet-variant/src/builder.rs
Outdated
| buf[start_pos..start_pos + nbytes as usize].copy_from_slice(&bytes[..nbytes as usize]); | ||
| } | ||
|
|
||
| /// Append `value_bytes` of given `value` into `dest`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value_bytes is the byte width of the value?
| /// Append `value_bytes` of given `value` into `dest`. | |
| /// Append `value_bytes` bytes of given `value` into `dest`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could just call it value_size like most of the other parts of the code do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| // Calculated header size becomes a hint; being wrong only risks extra allocations. | ||
| // Make sure to reserve enough capacity to handle the extra bytes we'll truncate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, can we rephrase the comment, I don't quite get what it means. Do you try to say the header_size is just a hint and we will allocate extra space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When header_size will be incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When
header_sizewill be incorrect?
The size if calculated separately, and then the actual bytes are appended. That opens up a bug surface -- any time the two disagree, header_size will be wrong. If the code directly relied on the size being correct, e.g. because we allocate that many bytes and then index them, we could have produce a bad variant value (either because there's an extra run of inserted bytes, or because of a buffer overflow while indexing. But because the calculated size is only a capacity hint for the vec, the cost of being wrong is very low.
parquet-variant/src/builder.rs
Outdated
| let starting_offset = self.parent_value_offset_base; | ||
|
|
||
| let header_size = 1 + // header | ||
| if is_large { 4 } else { 1 } + // is_large |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if is_large { 4 } else { 1 } + // is_large | |
| if is_large { 4 } else { 1 } + // is_large: 4 bytes, else 1 byte. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
parquet-variant/src/builder.rs
Outdated
| let starting_offset = parent_buffer.offset(); | ||
| let starting_offset = self.parent_value_offset_base; | ||
|
|
||
| let header_size = 1 + // header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let header_size = 1 + // header | |
| let header_size = 1 + // header (i.e., `array_header`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
I meant 3. specifically, https://github.com/apache/arrow-rs/pull/7987/files#diff-19c7b0b0d73ef11489af7932f49046a19ec7790896a8960add5a3ded21d5657aR1230 ( I thought I left a specific comment about this but I can't find it now 🤔 ) Basically rather than allocating a new temporary vector to create the header and then splicing those bytes in like let mut bytes_to_splice = Vec::with_capacity(header_size + 3);
// .... build header
// slice
buffer
.inner_mut()
.splice(starting_offset..starting_offset, bytes_to_splice);I meant avoiding that allocation by shifting the byes over in one go and then writing directly into the output buffer: // insert header_size bytes into the output, shifting existing bytes down
buffer.splice(starting_offset..starting_offset+header_length, std::iter::repeat(0));
// write header directly into buffer[starting_offset], buffer[starting_offset+1], etcThis looks somewhat similar to what @klion26 did in
Though in that example the header is created during the insertion My suggestion is to merge this PR as is and then we can fiddle around with potentially other optimizations as a follow on PR |
|
@klion26 it looks like there are some good suggestions from @viirya and @scovich -- so I will wait to merge this PR until you have a chance to review them. I think it would be fine to either
Please let us know what you prefer |
Yeah, IIRC that was the original approach, but I had cautioned that calculating the splice size incorrectly would cause subsequent indexing to corrupt the variant (either by leaving unused zeros or by overflowing the spliced region). And since the original approach was anyway using It could be that not allocating the temp buffer at all does improve performance even further, tho it would come with the risk of corruption if the spliced region were ever the wrong size. |
I agree So I think we should proceed with the approach in this PR and then we can go with the even fewer allocations approach if we can get some benchmarks that show it makes any measurable difference |
klion26
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-variant/src/builder.rs
Outdated
| parent_value_offset_base: offset_base, | ||
| has_been_finished: false, | ||
| parent_metadata_offset_base: meta_offset_base, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has changed the local variable name, the current implementation aims to make the compiler happy, as parent_state has been moved before(the first parameter).
parquet-variant/src/builder.rs
Outdated
| let data_size = self.buffer.offset(); | ||
| let buffer = self.parent_state.buffer(); | ||
|
|
||
| let data_size = buffer.offset() - self.parent_value_offset_base; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
parquet-variant/src/builder.rs
Outdated
| buf[start_pos..start_pos + nbytes as usize].copy_from_slice(&bytes[..nbytes as usize]); | ||
| } | ||
|
|
||
| /// Append `value_bytes` of given `value` into `dest`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
parquet-variant/src/builder.rs
Outdated
| let header_size = 1 + // header | ||
| if is_large { 4 } else { 1 } + // is_large |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
parquet-variant/src/builder.rs
Outdated
| let starting_offset = parent_buffer.offset(); | ||
| let starting_offset = self.parent_value_offset_base; | ||
|
|
||
| let header_size = 1 + // header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
parquet-variant/src/builder.rs
Outdated
| let starting_offset = self.parent_value_offset_base; | ||
|
|
||
| let header_size = 1 + // header | ||
| if is_large { 4 } else { 1 } + // is_large |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
parquet-variant/src/builder.rs
Outdated
|
|
||
| let header_size = 1 + // header | ||
| if is_large { 4 } else { 1 } + // is_large | ||
| (self.offsets.len() + 1) * offset_size as usize; // offsets and data size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
parquet-variant/src/builder.rs
Outdated
| append_packed_u32( | ||
| &mut bytes_to_splice, | ||
| num_elements as u32, | ||
| if is_large { 4 } else { 1 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed

This commit will reuse parent buffer for ListBuilder, so that it doesn't need to copy the buffer when finishing the builder.
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Rationale for this change
This pr wants to avoid the extra buffer allocation in ListBuilder.
What changes are included in this PR?
ListBuilder, all contents will be written to the buffer of the parent directlyListBuilder::finish, we'll fill the header for the current list in the parent's bufferdropifListBuilder::finishhas not been called.Are these changes tested?
The change was covered by existing tests mainly
test_nested_list_with_heterogeneous_fields_for_buffer_reuseAre there any user-facing changes?
No