-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5351: [Rust] Take kernel #4330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@sunchao I've used ArrayData comparison to test, I had expected this test to fail because of this. I use Windows, so it looks like ArrayData sometimes fails on Windows when comparing 2 similar arrays by their data. The reason being that null/invalid array slots return non-deterministic data. I'll attach an example tomorrow when I continue with this PR. I'm mentioning this because you suggested that we avoid using string/debug for comparison of arrays |
|
Yes, more efforts are required to improve equality check for |
|
@andygrove @sunchao @paddyhoran I have a few decisions to make, and would appreciate some guidance/opinions. I'm mostly done with this, but need to optimise bounds checking and implement options (such as whether to even check for bounds). Question: pandas, numpy, and the C++ version (cc @bkietz) have the option of supplying negative indices. Is this something that we'd want in Rust? If so, I'll change the function's |
|
What does a negative index mean? can you show some code examples where this is applied? |
|
It's worth noting that out of bounds and negative indices are currently just an error in C++ |
|
Thanks @bkietz, I noticed that there's open JIRAs for them. I wanted to get them out of the way earlier so we can stabilise the |
c37d72f to
b124dd4
Compare
|
I've done all I can here. I noticed that if I have an array with 6 values, |
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nevi-me for the update and sorry for the delay on reviewing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use import here instead of putting it on the top-level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a habit, I've become accustomed to importing enums inside the scope that I need them, to avoid repeating TimeUnit::A, TimeUnit::B, etc. Would you prefer I move it to the top-level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether there's a "official style guide" on this but it may be better to put it in top-level to 1) make it easier to find out what are the dependencies for a module, 2) potentially avoid repeating the use clause in all the places that need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: instead of Option<&TakeOptions> can we use Option<TakeOptions>? it might be more convenient to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call this indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, s/index/indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the C++ version, it may be better if we can optimize this by considering whether values/indices are all valid or not. This doesn't necessarily have to be done in this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean checking if we have 0 nulls and bypassing the null checks? I'll have a look at what C++ does, and can update this in a subsequent PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly. Sure it can be done in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call this values? array is self explanatory given that the type is ArrayRef.
rust/arrow/src/compute/util.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub(cratecould be changed topub(super)if this is only used within thecomputemodule? or, is it better to put this intake.rs?- the name is a little bit misleading - it gives the impression that this is taking indices from a list - maybe rename to
take_value_indices_from_listor something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep it here for now? I wanted to use it in a sort kernel, which would sort an array and return indices; then use this function (for lists) to take the appropriate value indices. Agree on the name, I like your suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Yes lets keep it here then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some optimizations can be done here such as if it is taking the whole list or a contiguous sublist. This can be left as a TODO though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? seems it is using value array's null count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sunchao, this was quite difficult to deal with; but I've fixed it. The index's null count was only correct if the values array had no null counts. I now recompute the null count from the offsets, where an offset of vec![0,1,1,2,2] contains 2 null values as 1 and 2 repeat. I'm getting ready to push the latest changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might be better to use ArrayDataBuilder here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still prefer ArrayData::new() as it's less verbose and reduces my chances of missing something with the builder, but I'll change to ArrayDataBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally ArrayDataBuilder should do a validation to make sure itself is well-formed before returning the constructed ArrayData. The advantage with ArrayDataBuilder is that 1): you don't need to handle default values by passing None; 2) it is a little bit clearer on the argument-parameter mapping (e.g., which buffer is used as null bit map, which buffer is value buffer, etc).
|
Hi @sunchao, PTAL. I've used @liurenjie1024 @andygrove @paddyhoran may you please also review when you can. Thanks |
Codecov Report
@@ Coverage Diff @@
## master #4330 +/- ##
==========================================
- Coverage 82.62% 82.49% -0.14%
==========================================
Files 335 87 -248
Lines 43377 24787 -18590
Branches 1418 0 -1418
==========================================
- Hits 35841 20448 -15393
+ Misses 7174 4339 -2835
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nevi-me ! This looks good. I just have a few more nits - after that we can get this committed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we still need clone here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a little confusing. Should we document the behavior of the take method using an example? e.g., an index array with both null & non-null elements, and a value array with both null & non-null elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added an example, but not as a runnable doc example as the function is private. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's fix the format: TODO Some -> TODO: some. Also break this long line into two.
rust/arrow/src/compute/util.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a : after TODO.
rust/arrow/src/compute/util.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can check this right after line 75, on whether start = end, so that we don't need to do the extra map operation.
rust/arrow/src/compute/util.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove this line?
Tests are still incomplete, and there aren't benchmarks yet
runtime is linear as would be expected
|
Thanks @sunchao, I've addressed your feedback. |
|
Looks good @nevi-me ! could you try to commit this yourself as you're already a committer now? it's also good for you to learn about the workflow :) |
|
Thanks @sunchao, I'm still setting up machine for the process, I'll commit it 😃🙏🏾 |
https://issues.apache.org/jira/browse/ARROW-5351 Implement `take` kernel, initial draft that hasn't been benchmarked, and is expected to be rough around the edges. Author: Neville Dipale <[email protected]> Closes #4330 from nevi-me/ARROW-5351 and squashes the following commits: 6e7af2d <Neville Dipale> address review comments 2467241 <Neville Dipale> address review feedback 0fc3f73 <Neville Dipale> address review comments 1adfccd <Neville Dipale> update tests, add bounds test 38c7a23 <Neville Dipale> add take kernel to mod exports b29e160 <Neville Dipale> add some benchmarks 5f44899 <Neville Dipale> add list and struct tests 0301468 <Neville Dipale> complete take functions for different arrays c675410 <Neville Dipale> ARROW-5351: Take kernel
https://issues.apache.org/jira/browse/ARROW-5351
Implement
takekernel, initial draft that hasn't been benchmarked, and is expected to be rough around the edges.