Skip to content

Conversation

@jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Dec 6, 2020

This PR:

  • extends the types that concat support for all types that MutableArrayData supports (i.e. it now supports nested Lists, all primitives, boolean, string and large string, etc.)
  • makes concat 6x faster for primitive types and 2x faster for string types (and likely also for the other types)
  • changes concat's signature to &[&Array] instead of &Vec<Arc<Array>>, to avoid an Arc::clone.

Since XBuilder::append_data was specifically built for this kernel but is not used, and MutableArrayData offers a more generic API for it, this PR removes that code.

The overall principle for this removal is that Builder is the API to build an arrow array from elements or slices of rust native types, while the MutableArrayData (for a lack of a better name) is suited to build an arrow array from an existing set of arrow arrays. In the case of concat, this corresponds to mem-copies of the individual arrays (taking into account nulls and all that stuff) in sequence.

Based on this principle, Builder does not need to know how to build an array from existing arrays (the append_data).

Benchmarks:

benchmark variation (%)
concat str 1024 -45.3
concat str nulls 1024 -61.1
concat i32 1024 -83.5
concat i32 nulls 1024 -86.1
git checkout 66468daf0b3ac3ef08b7c99c690e7b845f23ad2b
cargo bench --bench concatenate_kernel
git checkout concat
cargo bench --bench concatenate_kernel
Previous HEAD position was 66468daf0 Added concatenate bench
Switched to branch 'concat'
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 58.72s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/concatenate_kernel-94b8f5621cd4f767
Gnuplot not found, using plotters backend
concat i32 1024         time:   [4.2852 us 4.2912 us 4.2973 us]                             
                        change: [-83.690% -83.469% -83.188%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

concat i32 nulls 1024   time:   [4.8617 us 4.8820 us 4.9080 us]                                   
                        change: [-86.335% -86.101% -85.813%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

concat str 1024         time:   [19.472 us 19.527 us 19.593 us]                             
                        change: [-46.212% -45.314% -44.341%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

concat str nulls 1024   time:   [39.447 us 39.525 us 39.613 us]                                   
                        change: [-61.858% -61.091% -60.311%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

@github-actions
Copy link

github-actions bot commented Dec 6, 2020

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great simplification, awesome work @jorgecarleitao 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there value in porting some of these edge-case tests for mutable array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is. PR's description:

I would like to migrate all the tests for the XBuilder::append_data to the MutableArrayData, to not lose them, but for that #8850 #8852 #8851 and #8849 and #8848 needs to land first (thus being a draft).

;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, my bad, my eyes couldn't resist to skip that part and went into the bechmark section :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have incorporated all tests. There were some small bugs in the tests wrt to nulls, and also a bug in the MutableDataArray. So, overall, everyone won there.

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Dec 11, 2020
@github-actions github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Dec 11, 2020
@jorgecarleitao jorgecarleitao self-assigned this Dec 12, 2020
@jorgecarleitao jorgecarleitao marked this pull request as ready for review December 12, 2020 04:30
@jorgecarleitao
Copy link
Member Author

jorgecarleitao commented Dec 12, 2020

I have now migrated all tests to the MutableArrayData, which demonstrated a bug. The bug is fixed as part of this PR.

This is now ready to review.

@jorgecarleitao jorgecarleitao removed their assignment Dec 12, 2020
@codecov-io
Copy link

codecov-io commented Dec 12, 2020

Codecov Report

Merging #8853 (1f3c773) into master (2816f37) will decrease coverage by 23.19%.
The diff coverage is 99.19%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #8853       +/-   ##
===========================================
- Coverage   76.99%   53.79%   -23.20%     
===========================================
  Files         174      170        -4     
  Lines       40392    30069    -10323     
===========================================
- Hits        31099    16176    -14923     
- Misses       9293    13893     +4600     
Impacted Files Coverage Δ
rust/arrow/src/array/builder.rs 83.10% <ø> (+1.36%) ⬆️
...ust/datafusion/src/physical_plan/hash_aggregate.rs 0.00% <ø> (ø)
rust/datafusion/src/physical_plan/sort.rs 0.00% <ø> (ø)
rust/arrow/src/array/array_binary.rs 90.51% <91.30%> (+0.04%) ⬆️
rust/arrow/src/array/transform/fixed_binary.rs 78.94% <100.00%> (+36.84%) ⬆️
rust/arrow/src/array/transform/mod.rs 83.90% <100.00%> (+7.05%) ⬆️
rust/arrow/src/compute/kernels/concat.rs 100.00% <100.00%> (+18.75%) ⬆️
rust/arrow/src/ffi.rs 70.61% <100.00%> (ø)
rust/parquet/src/column/page.rs 0.00% <0.00%> (-98.69%) ⬇️
rust/parquet/src/record/api.rs 0.00% <0.00%> (-98.15%) ⬇️
... and 60 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2816f37...1f3c773. Read the comment docs.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great simplification, I like it

@alamb alamb changed the title ARROW-10827: [Rust] Extended concat and made it faster (2-6x) ARROW-10827: [Rust] Move concat from builders to a compute kernel and made it faster (2-6x) Dec 13, 2020
@alamb alamb changed the title ARROW-10827: [Rust] Move concat from builders to a compute kernel and made it faster (2-6x) ARROW-10827: [Rust] Move concat from builders to a compute kernel and make it faster (2-6x) Dec 13, 2020
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it -- nice (epic!) work @jorgecarleitao

fn from(data: Vec<Option<Vec<u8>>>) -> Self {
let len = data.len();
assert!(len > 0);
// try to estimate the size. This may not be possible no entry is valid => panic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior (panic'ing') doesn't seem ideal, though I realize there isn't much useful to do when converting a Vec of entirely None -- maybe we could just return a zero length array.

Could definitely be done as a follow on PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in general we should avoid using these because they require two allocations (rust's Vec and arrow buffers). This function is mostly useful for testing.

I would be ok with replacing them by the FromIter constructor, which is more performance, more general, and has the same ergonomics (from(vec![].into_iter()) instead of from(vec![...]) for a vector). This way we do not need to worry about these.

The challenge with fixed sized items is that they require knowledge of the size. This would be nicely solved by accepting Option<[T; T: usize]>, but Rust's support for constant generics is slim atm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert!(len > 0);
// try to estimate the size. This may not be possible no entry is valid => panic
let size = data.iter().filter_map(|e| e.as_ref()).next().unwrap().len();
assert!(data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this operation can fail (if all the elements are not the same length) perhaps we should implement TryFrom instead of From and panic -- again, this would be an excellent follow on PR (or maybe file a ticket and it could be an excellent first contribution for someone who wanted to contribute)

let mut mutable = MutableArrayData::new(arrays, false, capacity);

for (i, len) in lengths.iter().enumerate() {
mutable.extend(i, 0, *len)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, that is certainly nicer

fn from(data: Vec<Option<Vec<u8>>>) -> Self {
let len = data.len();
assert!(len > 0);
// try to estimate the size. This may not be possible no entry is valid => panic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Dec 14, 2020

Time to 🚢 🇮🇹 !

@alamb alamb closed this in 7f3794c Dec 14, 2020
@jorgecarleitao jorgecarleitao deleted the concat branch December 17, 2020 05:55
alamb pushed a commit that referenced this pull request Mar 18, 2021
…r FixedSizeBinaryArray

This Pr is follow up for #8853 (comment) .
I was not able to utilize `TryFrom` because of conflicting implementations, so instead I created two new functions `try_from_sparse_iter` and `try_from_iter` in place of `impl From<Vec<Vec<u8>>> for FixedSizeBinaryArray` and `impl From<Vec<Option<Vec<u8>>>> for FixedSizeBinaryArray`

Closes #9647 from ivanvankov/ARROW-10903

Authored-by: ivan <ivan@comp5328>
Signed-off-by: Andrew Lamb <[email protected]>
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…r FixedSizeBinaryArray

This Pr is follow up for apache/arrow#8853 (comment) .
I was not able to utilize `TryFrom` because of conflicting implementations, so instead I created two new functions `try_from_sparse_iter` and `try_from_iter` in place of `impl From<Vec<Vec<u8>>> for FixedSizeBinaryArray` and `impl From<Vec<Option<Vec<u8>>>> for FixedSizeBinaryArray`

Closes #9647 from ivanvankov/ARROW-10903

Authored-by: ivan <ivan@comp5328>
Signed-off-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants