ARROW-9922: [Rust] Add StructArray::TryFrom (+40%) #8118

jorgecarleitao · 2020-09-06T08:47:30Z

The core problem that this PR addresses is the construction of a StructArray, whose spec can be found here.

The current API to build a StructArray of 4 entries of fixed type is (part of a test):

let string_builder = StringBuilder::new(4);
let int_builder = Int32Builder::new(4);

let mut fields = Vec::new();
let mut field_builders = Vec::new();
fields.push(Field::new("f1", DataType::Utf8, false));
field_builders.push(Box::new(string_builder) as Box<ArrayBuilder>);
fields.push(Field::new("f2", DataType::Int32, false));
field_builders.push(Box::new(int_builder) as Box<ArrayBuilder>);

let mut builder = StructBuilder::new(fields, field_builders);
assert_eq!(2, builder.num_fields());

let string_builder = builder
    .field_builder::<StringBuilder>(0)
    .expect("builder at field 0 should be string builder");
string_builder.append_value("joe").unwrap();
string_builder.append_null().unwrap();
string_builder.append_null().unwrap();
string_builder.append_value("mark").unwrap();

let int_builder = builder
    .field_builder::<Int32Builder>(1)
    .expect("builder at field 1 should be int builder");
int_builder.append_value(1).unwrap();
int_builder.append_value(2).unwrap();
int_builder.append_null().unwrap();
int_builder.append_value(4).unwrap();

builder.append(true).unwrap();
builder.append(true).unwrap();
builder.append_null().unwrap();
builder.append(true).unwrap();

let arr = builder.finish();

This PR's proposal for the same array:

let strings: ArrayRef = Arc::new(StringArray::from(vec![
    Some("joe"),
    None,
    None,
    Some("mark"),
]));
let ints: ArrayRef = Arc::new(Int32Array::from(vec![Some(1), Some(2), None, Some(4)]));

let arr = StructArray::try_from(vec![("f1", strings.clone()), ("f2", ints.clone())]).unwrap();

Note that:

There is no Field, only name: the attributes (type and nullability) are obtained from the ArrayData's itself, and thus there a guarantee that the field's attributes are aligned with the Data.
The implementation is dynamically typed: the type is obtained from Array::data_type, instead of having to match Field's datatype to each field' builders
Option is used to specify whether the quantity is null or not

The construction uses an OR on the entry's null bitmaps to decide whether the struct null bitmap is null at a given index. I.e. the third index of the example in the spec is obtained by checking if all fields are null at that index.

There is an edge case, that this constructor is unable to build (and the user needs to use the other From): a struct with a 0 at position X and all field's bitmap at position X to be 1:

# array of 1 entry:
bitmap struct = [0]
bitmap field1 = [1]
bitmap field2 = [1]

this is because, in this TryFrom, the bitmap of the struct is computed from a bitwise or of the field's entries.

IMO this is a non-issue because a null in the struct already implies an unspecified value on every field and thus that field's value is already assumed to be undefined. However, this is important to mention as a round-trip with this case will fail: in the example above, bitmap struct will have a 1.

Finally, this has a performance improvement of 40%.

Benchmark results

git checkout HEAD^ && cargo bench --bench array_from_vec -- struct_array_from_vec && git checkout no_builder1 && cargo bench --bench array_from_vec -- struct_array_from_vec

struct_array_from_vec 128                                                                             
                        time:   [7.7464 us 7.7586 us 7.7731 us]
                        change: [-39.227% -38.313% -37.128%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

struct_array_from_vec 256                                                                             
                        time:   [9.3386 us 9.3611 us 9.3896 us]
                        change: [-45.035% -44.498% -43.914%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe

struct_array_from_vec 512                                                                             
                        time:   [13.107 us 13.148 us 13.199 us]
                        change: [-49.213% -48.705% -48.208%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  10 (10.00%) high severe

struct_array_from_vec 1024                                                                             
                        time:   [20.036 us 20.061 us 20.087 us]
                        change: [-54.254% -53.479% -52.776%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

Final note:

The general direction that I am heading with this is to minimize the usage of builders. My issue with builders is that they are statically typed and perform incremental changes, but almost all our operations are dynamically typed and in bulk: batch read, batch write, etc. As such, it is often faster (and much simpler from UX's perspective) to create a Vec<Option<_>> and use it to create an Arrow Array.

FYI @nevi-me @andygrove @alamb

github-actions · 2020-09-06T09:05:30Z

https://issues.apache.org/jira/browse/ARROW-9922

rust/arrow/src/array/builder.rs

andygrove · 2020-09-06T17:08:59Z

Maybe I am misunderstanding, but I think there may be a flaw with this approach and we're not comparing apples with apples when looking at the benchmarks.

The original code is dynamically building a struct using the builder. The new code starts with a vec! where everything is known at compile time. In theory, the builders should be more efficient than building a Vec and then converting it.

jorgecarleitao · 2020-09-06T20:23:18Z

Maybe I am misunderstanding, but I think there may be a flaw with this approach and we're not comparing apples with apples when looking at the benchmarks.

The original code is dynamically building a struct using the builder. The new code starts with a vec! where everything is known at compile time. In theory, the builders should be more efficient than building a Vec and then converting it.

I though that criterion::black_box() would block the compiler from optimizing the code on it, so that the benchmark would not be tainted by compiler optimizations. I use these in both the Builder and From.

Regardless, the reason I used this approach was because I looked through the code on where we use Builders, and I found two main inputs:

a vector:
- constructed from reading batches of rows (e.g. StringRecord in CSV, &[Value] in json)
- constructed in memory from some external source (e.g. MemoryScan)
an Arrow Array, in most in-memory calculations (e.g. RecordBatch and ArrayRef, in compute and DataFusion)

In all cases, we use the builders to append rows row-by-row:

see here for CSV
see here for JSON
in parquet we do not use Array builders
see here for an example in compute

Based on this analysis, I though that:

this benchmark was a good representation of our use-cases
we can use [Try]From to build our results instead of a builder. The from is essentially builder.append_many().finish(), with a significantly simpler API

andygrove · 2020-09-06T20:40:02Z

Sorry, I should have been clearer with my comments. I was referring to the code samples in the PR description. The first example is using the builders to push data into the contiguous buffers that will eventually become the arrays. The second example is building an intermediate data structure (the Vec) and then copying from the Vec into buffers. so there is 2x the memory usage and additional memory allocations and copies. Maybe I'm reading too much into these examples though.

Aside from that, I can't see how we can remove the StructBuilder though. The builder has semantics that I don't think we can cover by creating a struct from individual ArrayRefs representing fields. For example, how would we append a null struct versus appending a struct with null fields?

jorgecarleitao · 2020-09-06T21:34:13Z

Aside from that, I can't see how we can remove the StructBuilder though. The builder has semantics that I don't think we can cover by creating a struct from individual ArrayRefs representing fields. For example, how would we append a null struct versus appending a struct with null fields?

I will roll back the builder and associated tests.

Just for my understanding, since I do not know the background: is it fair to say that the primary purpose of builders is to enable users to create Arrow Arrays without having to fiddle with Buffers, ArrayData? I.e. they are like helpers.

andygrove · 2020-09-06T21:40:54Z

Yes, that's correct. These are the main high-level APIs for users to create Arrow structures. The convenience methods that convert from Vec are great for examples and tests but I would guess that they are not widely used for code that has high performance requirements.

…

On Sun, Sep 6, 2020 at 3:34 PM Jorge Leitao ***@***.***> wrote: Aside from that, I can't see how we can remove the StructBuilder though. The builder has semantics that I don't think we can cover by creating a struct from individual ArrayRefs representing fields. For example, how would we append a null struct versus appending a struct with null fields? I will roll back the builder and associated tests. Just for my understanding, since I do not know the background: is it fair to say that the primary purpose of builders is to enable users to create Arrow Arrays without having to fiddle with Buffers, ArrayData? I.e. they are like helpers. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8118 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHEBRD56FPE45RQ4ILMOQ3SEP56DANCNFSM4Q4DPXMQ> .

jorgecarleitao · 2020-09-06T21:56:54Z

but I would guess that they are not widely used for code that has high performance requirements.

That is interesting. So, in terms of performance, is it fair to rank then as:

Buffers / ArrayData
From
Builder*

jorgecarleitao · 2020-09-19T17:31:33Z

@nevi-me and @andygrove , I reverted the change wrt to the builder, so that this is an additive PR.

@andygrove, wrt to the dynamically building the array, note that a StructArray is almost only composed by child data: the struct itself is a null bitmap and some pointers. Therefore, the cost of building a Struct will always be driven by the allocation of those buffers.

With that said, you are right that during the creation of the fields, the benchmark clones the arrays, while a builder will build them on the fly and thus reduce memory footprint.

IMO that issue is separated from the creation of the struct itself (but related to the build of its childs): it is how we efficiently build non-struct arrays without first allocating vectors, that the builders aimed at solving. I am outlying some of this on #8211, which allows to build primitive Arrays from an iterator without exposing a unsafe API to users and would avoid the double allocation that you refer to.

andygrove · 2020-09-21T13:57:12Z

rust/arrow/src/array/array.rs

        let num_byte = bit_util::ceil(data_len, 8);
-        let mut null_buf = MutableBuffer::new(num_byte).with_bitset(num_byte, false);
+        let mut null_buf = make_null_buffer(data.len());
        let mut val_buf = MutableBuffer::new(num_byte).with_bitset(num_byte, false);


Shouldn't val_buf also be initialized by calling make_null_buffer?

andygrove

LGTM

nevi-me reviewed Sep 6, 2020

View reviewed changes

rust/arrow/src/array/builder.rs Outdated Show resolved Hide resolved

andygrove added the Component: Rust label Sep 6, 2020

jorgecarleitao changed the title ~~ARROW-9922: [Rust] Add StructArray::TryFrom and deprecate StructBuilder (+40%)~~ ARROW-9922: [Rust] Add StructArray::TryFrom (+40%) Sep 19, 2020

jorgecarleitao added 4 commits September 19, 2020 18:55

Added benchmark.

041b9b9

Added StructArray::TryFrom.

8d31dd5

updated bench.

231121e

Clippy.

379ed4d

andygrove reviewed Sep 21, 2020

View reviewed changes

andygrove approved these changes Sep 21, 2020

View reviewed changes

andygrove closed this in bc987cd Sep 21, 2020

jorgecarleitao deleted the no_builder1 branch October 28, 2020 04:18

asfimport mentioned this pull request Sep 22, 2020

[Rust] Add try_from(Vec<Option<(&str, ArrayRef)>>) to StructArray #25953

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-9922: [Rust] Add StructArray::TryFrom (+40%) #8118

ARROW-9922: [Rust] Add StructArray::TryFrom (+40%) #8118

Uh oh!

jorgecarleitao commented Sep 6, 2020 •

edited

Loading

Uh oh!

github-actions bot commented Sep 6, 2020

Uh oh!

Uh oh!

andygrove commented Sep 6, 2020

Uh oh!

jorgecarleitao commented Sep 6, 2020 •

edited

Loading

Uh oh!

andygrove commented Sep 6, 2020

Uh oh!

jorgecarleitao commented Sep 6, 2020

Uh oh!

andygrove commented Sep 6, 2020 via email

Uh oh!

jorgecarleitao commented Sep 6, 2020

Uh oh!

jorgecarleitao commented Sep 19, 2020

Uh oh!

andygrove Sep 21, 2020

Uh oh!

andygrove left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-9922: [Rust] Add StructArray::TryFrom (+40%) #8118

ARROW-9922: [Rust] Add StructArray::TryFrom (+40%) #8118

Uh oh!

Conversation

jorgecarleitao commented Sep 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Final note:

Uh oh!

github-actions bot commented Sep 6, 2020

Uh oh!

Uh oh!

andygrove commented Sep 6, 2020

Uh oh!

jorgecarleitao commented Sep 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Sep 6, 2020

Uh oh!

jorgecarleitao commented Sep 6, 2020

Uh oh!

andygrove commented Sep 6, 2020 via email

Uh oh!

jorgecarleitao commented Sep 6, 2020

Uh oh!

jorgecarleitao commented Sep 19, 2020

Uh oh!

andygrove Sep 21, 2020

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jorgecarleitao commented Sep 6, 2020 •

edited

Loading

jorgecarleitao commented Sep 6, 2020 •

edited

Loading