-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10827: [Rust] Move concat from builders to a compute kernel and make it faster (2-6x) #8853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
houqp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great simplification, awesome work @jorgecarleitao 👍
rust/arrow/src/array/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there value in porting some of these edge-case tests for mutable array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, my bad, my eyes couldn't resist to skip that part and went into the bechmark section :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have incorporated all tests. There were some small bugs in the tests wrt to nulls, and also a bug in the MutableDataArray. So, overall, everyone won there.
|
I have now migrated all tests to the This is now ready to review. |
Codecov Report
@@ Coverage Diff @@
## master #8853 +/- ##
===========================================
- Coverage 76.99% 53.79% -23.20%
===========================================
Files 174 170 -4
Lines 40392 30069 -10323
===========================================
- Hits 31099 16176 -14923
- Misses 9293 13893 +4600
Continue to review full report at Codecov.
|
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great simplification, I like it
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it -- nice (epic!) work @jorgecarleitao
| fn from(data: Vec<Option<Vec<u8>>>) -> Self { | ||
| let len = data.len(); | ||
| assert!(len > 0); | ||
| // try to estimate the size. This may not be possible no entry is valid => panic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior (panic'ing') doesn't seem ideal, though I realize there isn't much useful to do when converting a Vec of entirely None -- maybe we could just return a zero length array.
Could definitely be done as a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in general we should avoid using these because they require two allocations (rust's Vec and arrow buffers). This function is mostly useful for testing.
I would be ok with replacing them by the FromIter constructor, which is more performance, more general, and has the same ergonomics (from(vec![].into_iter()) instead of from(vec![...]) for a vector). This way we do not need to worry about these.
The challenge with fixed sized items is that they require knowledge of the size. This would be nicely solved by accepting Option<[T; T: usize]>, but Rust's support for constant generics is slim atm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert!(len > 0); | ||
| // try to estimate the size. This may not be possible no entry is valid => panic | ||
| let size = data.iter().filter_map(|e| e.as_ref()).next().unwrap().len(); | ||
| assert!(data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this operation can fail (if all the elements are not the same length) perhaps we should implement TryFrom instead of From and panic -- again, this would be an excellent follow on PR (or maybe file a ticket and it could be an excellent first contribution for someone who wanted to contribute)
| let mut mutable = MutableArrayData::new(arrays, false, capacity); | ||
|
|
||
| for (i, len) in lengths.iter().enumerate() { | ||
| mutable.extend(i, 0, *len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, that is certainly nicer
| fn from(data: Vec<Option<Vec<u8>>>) -> Self { | ||
| let len = data.len(); | ||
| assert!(len > 0); | ||
| // try to estimate the size. This may not be possible no entry is valid => panic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Time to 🚢 🇮🇹 ! |
…r FixedSizeBinaryArray This Pr is follow up for #8853 (comment) . I was not able to utilize `TryFrom` because of conflicting implementations, so instead I created two new functions `try_from_sparse_iter` and `try_from_iter` in place of `impl From<Vec<Vec<u8>>> for FixedSizeBinaryArray` and `impl From<Vec<Option<Vec<u8>>>> for FixedSizeBinaryArray` Closes #9647 from ivanvankov/ARROW-10903 Authored-by: ivan <ivan@comp5328> Signed-off-by: Andrew Lamb <[email protected]>
…r FixedSizeBinaryArray This Pr is follow up for apache/arrow#8853 (comment) . I was not able to utilize `TryFrom` because of conflicting implementations, so instead I created two new functions `try_from_sparse_iter` and `try_from_iter` in place of `impl From<Vec<Vec<u8>>> for FixedSizeBinaryArray` and `impl From<Vec<Option<Vec<u8>>>> for FixedSizeBinaryArray` Closes #9647 from ivanvankov/ARROW-10903 Authored-by: ivan <ivan@comp5328> Signed-off-by: Andrew Lamb <[email protected]>
This PR:
concatsupport for all types thatMutableArrayDatasupports (i.e. it now supports nested Lists, all primitives, boolean, string and large string, etc.)concat6x faster for primitive types and 2x faster for string types (and likely also for the other types)concat's signature to&[&Array]instead of&Vec<Arc<Array>>, to avoid anArc::clone.Since
XBuilder::append_datawas specifically built for this kernel but is not used, andMutableArrayDataoffers a more generic API for it, this PR removes that code.The overall principle for this removal is that
Builderis the API to build an arrow array from elements or slices of rust native types, while theMutableArrayData(for a lack of a better name) is suited to build an arrow array from an existing set of arrow arrays. In the case ofconcat, this corresponds to mem-copies of the individual arrays (taking into account nulls and all that stuff) in sequence.Based on this principle,
Builderdoes not need to know how to build an array from existing arrays (theappend_data).Benchmarks: