Skip to content

Conversation

@jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Sep 6, 2020

The core problem that this PR addresses is the construction of a StructArray, whose spec can be found here.

The current API to build a StructArray of 4 entries of fixed type is (part of a test):

let string_builder = StringBuilder::new(4);
let int_builder = Int32Builder::new(4);

let mut fields = Vec::new();
let mut field_builders = Vec::new();
fields.push(Field::new("f1", DataType::Utf8, false));
field_builders.push(Box::new(string_builder) as Box<ArrayBuilder>);
fields.push(Field::new("f2", DataType::Int32, false));
field_builders.push(Box::new(int_builder) as Box<ArrayBuilder>);

let mut builder = StructBuilder::new(fields, field_builders);
assert_eq!(2, builder.num_fields());

let string_builder = builder
    .field_builder::<StringBuilder>(0)
    .expect("builder at field 0 should be string builder");
string_builder.append_value("joe").unwrap();
string_builder.append_null().unwrap();
string_builder.append_null().unwrap();
string_builder.append_value("mark").unwrap();

let int_builder = builder
    .field_builder::<Int32Builder>(1)
    .expect("builder at field 1 should be int builder");
int_builder.append_value(1).unwrap();
int_builder.append_value(2).unwrap();
int_builder.append_null().unwrap();
int_builder.append_value(4).unwrap();

builder.append(true).unwrap();
builder.append(true).unwrap();
builder.append_null().unwrap();
builder.append(true).unwrap();

let arr = builder.finish();

This PR's proposal for the same array:

let strings: ArrayRef = Arc::new(StringArray::from(vec![
    Some("joe"),
    None,
    None,
    Some("mark"),
]));
let ints: ArrayRef = Arc::new(Int32Array::from(vec![Some(1), Some(2), None, Some(4)]));

let arr = StructArray::try_from(vec![("f1", strings.clone()), ("f2", ints.clone())]).unwrap();

Note that:

  • There is no Field, only name: the attributes (type and nullability) are obtained from the ArrayData's itself, and thus there a guarantee that the field's attributes are aligned with the Data.
  • The implementation is dynamically typed: the type is obtained from Array::data_type, instead of having to match Field's datatype to each field' builders
  • Option is used to specify whether the quantity is null or not

The construction uses an OR on the entry's null bitmaps to decide whether the struct null bitmap is null at a given index. I.e. the third index of the example in the spec is obtained by checking if all fields are null at that index.

There is an edge case, that this constructor is unable to build (and the user needs to use the other From): a struct with a 0 at position X and all field's bitmap at position X to be 1:

# array of 1 entry:
bitmap struct = [0]
bitmap field1 = [1]
bitmap field2 = [1]

this is because, in this TryFrom, the bitmap of the struct is computed from a bitwise or of the field's entries.

IMO this is a non-issue because a null in the struct already implies an unspecified value on every field and thus that field's value is already assumed to be undefined. However, this is important to mention as a round-trip with this case will fail: in the example above, bitmap struct will have a 1.

Finally, this has a performance improvement of 40%.

Benchmark results
git checkout HEAD^ && cargo bench --bench array_from_vec -- struct_array_from_vec && git checkout no_builder1 && cargo bench --bench array_from_vec -- struct_array_from_vec
struct_array_from_vec 128                                                                             
                        time:   [7.7464 us 7.7586 us 7.7731 us]
                        change: [-39.227% -38.313% -37.128%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

struct_array_from_vec 256                                                                             
                        time:   [9.3386 us 9.3611 us 9.3896 us]
                        change: [-45.035% -44.498% -43.914%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe

struct_array_from_vec 512                                                                             
                        time:   [13.107 us 13.148 us 13.199 us]
                        change: [-49.213% -48.705% -48.208%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  10 (10.00%) high severe

struct_array_from_vec 1024                                                                             
                        time:   [20.036 us 20.061 us 20.087 us]
                        change: [-54.254% -53.479% -52.776%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

Final note:

The general direction that I am heading with this is to minimize the usage of builders. My issue with builders is that they are statically typed and perform incremental changes, but almost all our operations are dynamically typed and in bulk: batch read, batch write, etc. As such, it is often faster (and much simpler from UX's perspective) to create a Vec<Option<_>> and use it to create an Arrow Array.

FYI @nevi-me @andygrove @alamb

@github-actions
Copy link

github-actions bot commented Sep 6, 2020

@andygrove
Copy link
Member

Maybe I am misunderstanding, but I think there may be a flaw with this approach and we're not comparing apples with apples when looking at the benchmarks.

The original code is dynamically building a struct using the builder. The new code starts with a vec! where everything is known at compile time. In theory, the builders should be more efficient than building a Vec and then converting it.

@jorgecarleitao
Copy link
Member Author

jorgecarleitao commented Sep 6, 2020

Maybe I am misunderstanding, but I think there may be a flaw with this approach and we're not comparing apples with apples when looking at the benchmarks.

The original code is dynamically building a struct using the builder. The new code starts with a vec! where everything is known at compile time. In theory, the builders should be more efficient than building a Vec and then converting it.

I though that criterion::black_box() would block the compiler from optimizing the code on it, so that the benchmark would not be tainted by compiler optimizations. I use these in both the Builder and From.

Regardless, the reason I used this approach was because I looked through the code on where we use Builders, and I found two main inputs:

  • a vector:
    • constructed from reading batches of rows (e.g. StringRecord in CSV, &[Value] in json)
    • constructed in memory from some external source (e.g. MemoryScan)
  • an Arrow Array, in most in-memory calculations (e.g. RecordBatch and ArrayRef, in compute and DataFusion)

In all cases, we use the builders to append rows row-by-row:

Based on this analysis, I though that:

  • this benchmark was a good representation of our use-cases
  • we can use [Try]From to build our results instead of a builder. The from is essentially builder.append_many().finish(), with a significantly simpler API

@andygrove
Copy link
Member

Sorry, I should have been clearer with my comments. I was referring to the code samples in the PR description. The first example is using the builders to push data into the contiguous buffers that will eventually become the arrays. The second example is building an intermediate data structure (the Vec) and then copying from the Vec into buffers. so there is 2x the memory usage and additional memory allocations and copies. Maybe I'm reading too much into these examples though.

Aside from that, I can't see how we can remove the StructBuilder though. The builder has semantics that I don't think we can cover by creating a struct from individual ArrayRefs representing fields. For example, how would we append a null struct versus appending a struct with null fields?

@jorgecarleitao
Copy link
Member Author

Aside from that, I can't see how we can remove the StructBuilder though. The builder has semantics that I don't think we can cover by creating a struct from individual ArrayRefs representing fields. For example, how would we append a null struct versus appending a struct with null fields?

I will roll back the builder and associated tests.

Just for my understanding, since I do not know the background: is it fair to say that the primary purpose of builders is to enable users to create Arrow Arrays without having to fiddle with Buffers, ArrayData? I.e. they are like helpers.

@andygrove
Copy link
Member

andygrove commented Sep 6, 2020 via email

@jorgecarleitao
Copy link
Member Author

but I would guess that they are not widely used for code that has high performance requirements.

That is interesting. So, in terms of performance, is it fair to rank then as:

  1. Buffers / ArrayData
  2. From
  3. Builder*

@jorgecarleitao jorgecarleitao changed the title ARROW-9922: [Rust] Add StructArray::TryFrom and deprecate StructBuilder (+40%) ARROW-9922: [Rust] Add StructArray::TryFrom (+40%) Sep 19, 2020
@jorgecarleitao
Copy link
Member Author

@nevi-me and @andygrove , I reverted the change wrt to the builder, so that this is an additive PR.

@andygrove, wrt to the dynamically building the array, note that a StructArray is almost only composed by child data: the struct itself is a null bitmap and some pointers. Therefore, the cost of building a Struct will always be driven by the allocation of those buffers.

With that said, you are right that during the creation of the fields, the benchmark clones the arrays, while a builder will build them on the fly and thus reduce memory footprint.

IMO that issue is separated from the creation of the struct itself (but related to the build of its childs): it is how we efficiently build non-struct arrays without first allocating vectors, that the builders aimed at solving. I am outlying some of this on #8211, which allows to build primitive Arrays from an iterator without exposing a unsafe API to users and would avoid the double allocation that you refer to.

let num_byte = bit_util::ceil(data_len, 8);
let mut null_buf = MutableBuffer::new(num_byte).with_bitset(num_byte, false);
let mut null_buf = make_null_buffer(data.len());
let mut val_buf = MutableBuffer::new(num_byte).with_bitset(num_byte, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't val_buf also be initialized by calling make_null_buffer?

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants