-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Parquet] perf: Create StructArrays directly rather than via ArrayData (1% improvement)
#9120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cdceb7e to
75ea40b
Compare
|
run benchmark arrow_reader arrow_reader_clickbench |
|
I suspect this is the biggest offender in terms of overhead. The same thing can be done to the other readers, which I think will also reduce some overhead (a single allocation) |
|
🤖 |
ArrayData
|
🤖: Benchmark completed Details
|
|
🤖: Benchmark completed Details
|
|
my analysis of results is not a huge win but some improvement |
|
The reason I think this will actually help ClickBench is that the ClickBench dataset has 105 columns. I could probably make a benchmark that shows this helping for reading really wide tables
|
| } | ||
|
|
||
| array_data_builder = array_data_builder.null_bit_buffer(Some(bitmap_builder.into())); | ||
| nulls = Some(NullBuffer::new(bitmap_builder.finish())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NullBuffer::new counts the set bits, but so do the existing code paths
| .child_data( | ||
| children_array | ||
| .into_iter() | ||
| .map(|x| x.into_data()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
converting the child arrays into ArrayData is wasteful for at least 2 reasons:
- They are just converted back to ArrayRefs below
- Each child ArrayData has at least one new allocation (the Vec of buffers)
ArrayDataArrayData (1% improvement)
|
This "regression" comes up twice in a row? |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm struggling to see how this change could have regressed performance on any benchmark?
I was able to reproduce a smaller regression (about 1%) -- I'll see what I can find cargo bench --features="arrow async" --bench arrow_reader_clickbench -- Q20Here it is with
With this PR:
Details
Here it is on this branch |
|
It seems like I'll rerun the benchmarks and hopefully we'll see an improvement |
|
run benchmark arrow_reader_clickbench |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark arrow_reader_clickbench |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Now q14/q21 show a reproducable slowdown on the VM |
🤔 given my experience with allocation related performance in the boolean kernels, e,g what lead to I think we should conclude the difference is related to something related to the allocator state I will take one more look at the code paths in involved as well and run the benchmarks again and locally |
|
run benchmark arrow_reader_clickbench |
|
🤖 |
|
Here are my measurements on my machine cargo bench --features="arrow async" --bench arrow_reader_clickbench -- Q21
Merge base This branch |
|
Same measurements on Q21/Q14: Merge base this branch |
|
🤖: Benchmark completed Details
|
|
I re-reviewed the code paths involved (and had codex do the same) and we found no additional allocations or work, so I am going to claim this is an improvement via inspection and discount the measurements Thank you @scovich and @Dandandan for the reviews |
This seems... odd. A new |
Oh... |
Maybe it's worth considering an alternative approach to the builder construction/finish protocol? Instead of allocating capacity on creation (and re-allocating on finish), allocate on first touch. Every append to a builder anyway has to do capacity checks, so the compiler should inline and optimize the branching so that the first-touch check is basically free in the common case where the allocation is already large enough? But looking at the code, pub fn with_capacity(capacity: usize) -> Self {
let capacity = bit_util::round_upto_multiple_of_64(capacity);
let layout = Layout::from_size_align(capacity, ALIGNMENT)
.expect("failed to create layout for MutableBuffer");
let data = match layout.size() {
0 => dangling_ptr(),
_ => {
// Safety: Verified size != 0
let raw_ptr = unsafe { std::alloc::alloc(layout) };
NonNull::new(raw_ptr).unwrap_or_else(|| handle_alloc_error(layout))
}
};
Self {
data,
len: 0,
layout,
#[cfg(feature = "pool")]
reservation: std::sync::Mutex::new(None),
}
} |
…rayData` (#9122) # Which issue does this PR close? - related to #9061 - Part of #9128 # Rationale for this change - similarly to #9120 Creating Arrays via ArrayData / `make_array` has overhead (at least 2 Vec allocations) compared to simply creating the arrays directly # What changes are included in this PR? Update the parquet reader to create `PrimitiveArray`s directly # Are these changes tested? By CI # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
Yeah, I am not sure how expensive the I'll make a follow on PR to document how this works. I also wonder why the builders done just have a (consuming) build method 🤔 |
Which issue does this PR close?
make_array) #9061Rationale for this change
I noticed on #9061 that there is non trivial overhead to create struct arrays. I am trying to improve
make_arrayin parallel, but @tustvold had an even better idea in #9058 (comment)What changes are included in this PR?
Update the parquet
StructArrayreader (used for the top level RecordBatch) to directly construct StructArray rather than using ArrayDataAre these changes tested?
By existing CI
Benchmarks show a small repeatable improvement of a few percent. For example
I am pretty sure this is because the click bench dataset has more than 100 columns. Creating such a struct array requires cloning 100
ArrayData(one for each child) which each has a Vec. So this saves (at least) 100 allocations per batchAre there any user-facing changes?