Skip to content

Conversation

@nevi-me
Copy link
Contributor

@nevi-me nevi-me commented May 16, 2019

https://issues.apache.org/jira/browse/ARROW-5351

Implement take kernel, initial draft that hasn't been benchmarked, and is expected to be rough around the edges.

@nevi-me
Copy link
Contributor Author

nevi-me commented May 16, 2019

@sunchao I've used ArrayData comparison to test, I had expected this test to fail because of this.

I use Windows, so it looks like ArrayData sometimes fails on Windows when comparing 2 similar arrays by their data. The reason being that null/invalid array slots return non-deterministic data.

I'll attach an example tomorrow when I continue with this PR.

I'm mentioning this because you suggested that we avoid using string/debug for comparison of arrays

@sunchao
Copy link
Member

sunchao commented May 16, 2019

Yes, more efforts are required to improve equality check for ArrayData, Array and others. Currently they just use the derived equality which is not good. We should find a way to ignore the null/invalid data when doing the equality check.

@nevi-me
Copy link
Contributor Author

nevi-me commented May 23, 2019

@andygrove @sunchao @paddyhoran I have a few decisions to make, and would appreciate some guidance/opinions.

I'm mostly done with this, but need to optimise bounds checking and implement options (such as whether to even check for bounds).

Question: pandas, numpy, and the C++ version (cc @bkietz) have the option of supplying negative indices. Is this something that we'd want in Rust? If so, I'll change the function's index parameter to Int32Array from the current unsigned one.

@nevi-me nevi-me marked this pull request as ready for review May 23, 2019 21:26
@sunchao
Copy link
Member

sunchao commented May 24, 2019

What does a negative index mean? can you show some code examples where this is applied?

@nevi-me
Copy link
Contributor Author

nevi-me commented May 24, 2019

@bkietz
Copy link
Member

bkietz commented May 24, 2019

It's worth noting that out of bounds and negative indices are currently just an error in C++

@nevi-me
Copy link
Contributor Author

nevi-me commented May 24, 2019

Thanks @bkietz, I noticed that there's open JIRAs for them. I wanted to get them out of the way earlier so we can stabilise the take signature.

@nevi-me nevi-me force-pushed the ARROW-5351 branch 2 times, most recently from c37d72f to b124dd4 Compare June 2, 2019 16:52
@nevi-me
Copy link
Contributor Author

nevi-me commented Jun 2, 2019

I've done all I can here.

I noticed that if I have an array with 6 values, array.is_null(6) and array.is_null(7) won't panic because the array length's within a multiple of 8; so the 7th and 8th values are returned as null in the buffer. I don't know if this is an issue, and if we should document this behaviour.

@wesm
Copy link
Member

wesm commented Jun 12, 2019

@sunchao can you review? It seems then #4331 can be finished after this is merged

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nevi-me for the update and sorry for the delay on reviewing!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use import here instead of putting it on the top-level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a habit, I've become accustomed to importing enums inside the scope that I need them, to avoid repeating TimeUnit::A, TimeUnit::B, etc. Would you prefer I move it to the top-level?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether there's a "official style guide" on this but it may be better to put it in top-level to 1) make it easier to find out what are the dependencies for a module, 2) potentially avoid repeating the use clause in all the places that need it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of Option<&TakeOptions> can we use Option<TakeOptions>? it might be more convenient to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this indices?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, s/index/indices?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the C++ version, it may be better if we can optimize this by considering whether values/indices are all valid or not. This doesn't necessarily have to be done in this PR though.

Copy link
Contributor Author

@nevi-me nevi-me Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean checking if we have 0 nulls and bypassing the null checks? I'll have a look at what C++ does, and can update this in a subsequent PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly. Sure it can be done in a follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this values? array is self explanatory given that the type is ArrayRef.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • pub(crate could be changed to pub(super) if this is only used within the compute module? or, is it better to put this in take.rs?
  • the name is a little bit misleading - it gives the impression that this is taking indices from a list - maybe rename to take_value_indices_from_list or something else?

Copy link
Contributor Author

@nevi-me nevi-me Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep it here for now? I wanted to use it in a sort kernel, which would sort an array and return indices; then use this function (for lists) to take the appropriate value indices. Agree on the name, I like your suggestion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yes lets keep it here then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some optimizations can be done here such as if it is taking the whole list or a contiguous sublist. This can be left as a TODO though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? seems it is using value array's null count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sunchao, this was quite difficult to deal with; but I've fixed it. The index's null count was only correct if the values array had no null counts. I now recompute the null count from the offsets, where an offset of vec![0,1,1,2,2] contains 2 null values as 1 and 2 repeat. I'm getting ready to push the latest changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be better to use ArrayDataBuilder here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still prefer ArrayData::new() as it's less verbose and reduces my chances of missing something with the builder, but I'll change to ArrayDataBuilder

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally ArrayDataBuilder should do a validation to make sure itself is well-formed before returning the constructed ArrayData. The advantage with ArrayDataBuilder is that 1): you don't need to handle default values by passing None; 2) it is a little bit clearer on the argument-parameter mapping (e.g., which buffer is used as null bit map, which buffer is value buffer, etc).

@nevi-me
Copy link
Contributor Author

nevi-me commented Jul 6, 2019

Hi @sunchao, PTAL. I've used ArrayEqual for my comparisons, correctly calculated list offsets, and fixed other review comments. Apologies for taking very long on this, haven't had capacity in past weeks. I'm available now again, so I'll look at other Rust work.

@liurenjie1024 @andygrove @paddyhoran may you please also review when you can. Thanks

@codecov-io
Copy link

codecov-io commented Jul 7, 2019

Codecov Report

Merging #4330 into master will decrease coverage by 0.13%.
The diff coverage is 93.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4330      +/-   ##
==========================================
- Coverage   82.62%   82.49%   -0.14%     
==========================================
  Files         335       87     -248     
  Lines       43377    24787   -18590     
  Branches     1418        0    -1418     
==========================================
- Hits        35841    20448   -15393     
+ Misses       7174     4339    -2835     
+ Partials      362        0     -362
Impacted Files Coverage Δ
rust/arrow/src/array/builder.rs 92.3% <100%> (+0.04%) ⬆️
rust/arrow/src/array/array.rs 92.57% <100%> (+0.21%) ⬆️
rust/arrow/src/compute/kernels/take.rs 92.5% <92.5%> (ø)
rust/arrow/src/compute/util.rs 88.57% <95.34%> (+10.79%) ⬆️
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
r/src/symbols.cpp
r/R/Column.R
go/arrow/ipc/file_reader.go
... and 246 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7838886...6e7af2d. Read the comment docs.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nevi-me ! This looks good. I just have a few more nits - after that we can get this committed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we still need clone here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a little confusing. Should we document the behavior of the take method using an example? e.g., an index array with both null & non-null elements, and a value array with both null & non-null elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an example, but not as a runnable doc example as the function is private. PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's fix the format: TODO Some -> TODO: some. Also break this long line into two.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a : after TODO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can check this right after line 75, on whether start = end, so that we don't need to do the extra map operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this line?

@nevi-me
Copy link
Contributor Author

nevi-me commented Jul 9, 2019

Thanks @sunchao, I've addressed your feedback.

@sunchao
Copy link
Member

sunchao commented Jul 10, 2019

Looks good @nevi-me ! could you try to commit this yourself as you're already a committer now? it's also good for you to learn about the workflow :)

@nevi-me
Copy link
Contributor Author

nevi-me commented Jul 13, 2019

Thanks @sunchao, I'm still setting up machine for the process, I'll commit it 😃🙏🏾

@nevi-me nevi-me closed this in a54888a Jul 17, 2019
kszucs pushed a commit that referenced this pull request Jul 22, 2019
https://issues.apache.org/jira/browse/ARROW-5351

Implement `take` kernel, initial draft that hasn't been benchmarked, and is expected to be rough around the edges.

Author: Neville Dipale <[email protected]>

Closes #4330 from nevi-me/ARROW-5351 and squashes the following commits:

6e7af2d <Neville Dipale> address review comments
2467241 <Neville Dipale> address review feedback
0fc3f73 <Neville Dipale> address review comments
1adfccd <Neville Dipale> update tests, add bounds test
38c7a23 <Neville Dipale> add take kernel to mod exports
b29e160 <Neville Dipale> add some benchmarks
5f44899 <Neville Dipale> add list and struct tests
0301468 <Neville Dipale> complete take functions for different arrays
c675410 <Neville Dipale> ARROW-5351:  Take kernel
@nevi-me nevi-me deleted the ARROW-5351 branch October 17, 2020 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants