-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-4605: [Rust] Move filter and limit code from DataFusion into compute module #3741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rust/arrow/src/compute/array_ops.rs
Outdated
| } | ||
|
|
||
| pub fn limit(array: &Array, num_rows_to_read: usize) -> Result<ArrayRef> { | ||
| if array.len() < num_rows_to_read { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious as to why we return an error here. Why not set the limit to the lower of the 2 instead?
vec![1,2,3,4,5].iter().take(7) doesn't throw an error, and SELECT * FROM table_with_100_records LIMIT 200 will also only return the 100 records. What do you think about the suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial thought was that if someone calls this function with a num_rows_to_read larger than array.len(), they will probably want to know rather than silently choosing the minimum.
However given take and LIMIT behavior this is indeed probably not what is expected from this function. I will change it, good call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you limit the num_rows_to_read to the array length.
Have a look at https://issues.apache.org/jira/browse/ARROW-3954, which will make filter more efficient in future.
…String > &str > &[u8]
create BinaryArray directly from byte slice
| /// | ||
| /// Returns the whole array if the number of elements specified is larger than the length of the array | ||
| pub fn limit(array: &Array, num_elements: usize) -> Result<ArrayRef> { | ||
| let num_elements_safe: usize = cmp::min(array.len(), num_elements); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last nit, we could return the array as ArrayRef immediately if the limit >= len. I sold have thought of it earlier, my apologies.
I'm happy with everything else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can help me here: how can I wrap the reference to array in an Arc<Array> / ArrayRef? Since the reference to array can be freed at any time it would leave the Arc invalid, so I have to specify the lifetime somehow. Can I do that with Arc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're right, I missed that part. We can improve limit when we have zero-copy array slicing 👍🏾
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I made one suggestion
rust/arrow/src/array.rs
Outdated
|
|
||
| impl<'a> From<Vec<&[u8]>> for BinaryArray { | ||
| fn from(v: Vec<&[u8]>) -> Self { | ||
| let mut offsets = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more efficient to initialize these vectors with Vec::with_capacity(v.len())
rust/arrow/src/array.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I guess this should be Vec::with_capacity(v.len() + 1). Also you have an extra ;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, one more comment ...
rust/arrow/src/array.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing this again, I realize that this change is good for offsets but maybe not for values. We can only allocate values to the correct size if we first go and sum the lengths of all of the input strings, which would be expensive. Maybe we should just go back to vec![] or Vec::new() for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. Also since v can contain empty strings, we do not even have a lower bound for an initial capacity, so I'll revert to Vec::new() for values.
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks @ntrinquier !
No description provided.