-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9922: [Rust] Add StructArray::TryFrom (+40%) #8118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Maybe I am misunderstanding, but I think there may be a flaw with this approach and we're not comparing apples with apples when looking at the benchmarks. The original code is dynamically building a struct using the builder. The new code starts with a |
I though that Regardless, the reason I used this approach was because I looked through the code on where we use Builders, and I found two main inputs:
In all cases, we use the builders to append rows row-by-row:
Based on this analysis, I though that:
|
|
Sorry, I should have been clearer with my comments. I was referring to the code samples in the PR description. The first example is using the builders to push data into the contiguous buffers that will eventually become the arrays. The second example is building an intermediate data structure (the Vec) and then copying from the Vec into buffers. so there is 2x the memory usage and additional memory allocations and copies. Maybe I'm reading too much into these examples though. Aside from that, I can't see how we can remove the StructBuilder though. The builder has semantics that I don't think we can cover by creating a struct from individual ArrayRefs representing fields. For example, how would we append a null struct versus appending a struct with null fields? |
I will roll back the builder and associated tests. Just for my understanding, since I do not know the background: is it fair to say that the primary purpose of builders is to enable users to create Arrow Arrays without having to fiddle with |
|
Yes, that's correct. These are the main high-level APIs for users to create
Arrow structures. The convenience methods that convert from Vec are great
for examples and tests but I would guess that they are not widely used for
code that has high performance requirements.
…On Sun, Sep 6, 2020 at 3:34 PM Jorge Leitao ***@***.***> wrote:
Aside from that, I can't see how we can remove the StructBuilder though.
The builder has semantics that I don't think we can cover by creating a
struct from individual ArrayRefs representing fields. For example, how
would we append a null struct versus appending a struct with null fields?
I will roll back the builder and associated tests.
Just for my understanding, since I do not know the background: is it fair
to say that the primary purpose of builders is to enable users to create
Arrow Arrays without having to fiddle with Buffers, ArrayData? I.e. they
are like helpers.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8118 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHEBRD56FPE45RQ4ILMOQ3SEP56DANCNFSM4Q4DPXMQ>
.
|
That is interesting. So, in terms of performance, is it fair to rank then as:
|
|
@nevi-me and @andygrove , I reverted the change wrt to the builder, so that this is an additive PR. @andygrove, wrt to the dynamically building the array, note that a StructArray is almost only composed by child data: the struct itself is a null bitmap and some pointers. Therefore, the cost of building a Struct will always be driven by the allocation of those buffers. With that said, you are right that during the creation of the fields, the benchmark clones the arrays, while a builder will build them on the fly and thus reduce memory footprint. IMO that issue is separated from the creation of the struct itself (but related to the build of its childs): it is how we efficiently build non-struct arrays without first allocating vectors, that the builders aimed at solving. I am outlying some of this on #8211, which allows to build primitive Arrays from an iterator without exposing a unsafe API to users and would avoid the double allocation that you refer to. |
| let num_byte = bit_util::ceil(data_len, 8); | ||
| let mut null_buf = MutableBuffer::new(num_byte).with_bitset(num_byte, false); | ||
| let mut null_buf = make_null_buffer(data.len()); | ||
| let mut val_buf = MutableBuffer::new(num_byte).with_bitset(num_byte, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't val_buf also be initialized by calling make_null_buffer?
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The core problem that this PR addresses is the construction of a
StructArray, whose spec can be found here.The current API to build a
StructArrayof 4 entries of fixed type is (part of a test):This PR's proposal for the same array:
Note that:
Field, only name: the attributes (type and nullability) are obtained from theArrayData's itself, and thus there a guarantee that the field's attributes are aligned with the Data.Array::data_type, instead of having to match Field's datatype to each field' buildersOptionis used to specify whether the quantity is null or notThe construction uses an OR on the entry's null bitmaps to decide whether the struct null bitmap is null at a given index. I.e. the third index of the example in the spec is obtained by checking if all fields are null at that index.
There is an edge case, that this constructor is unable to build (and the user needs to use the other
From): a struct with a0at position X and all field's bitmap at position X to be1:this is because, in this
TryFrom, the bitmap of the struct is computed from a bitwiseorof the field's entries.IMO this is a non-issue because a
nullin the struct already implies anunspecifiedvalue on every field and thus that field's value is already assumed to be undefined. However, this is important to mention as a round-trip with this case will fail: in the example above,bitmap structwill have a1.Finally, this has a performance improvement of 40%.
Benchmark results
Final note:
The general direction that I am heading with this is to minimize the usage of builders. My issue with builders is that they are statically typed and perform incremental changes, but almost all our operations are dynamically typed and in bulk: batch read, batch write, etc. As such, it is often faster (and much simpler from UX's perspective) to create a
Vec<Option<_>>and use it to create an Arrow Array.FYI @nevi-me @andygrove @alamb