-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10136: [Rust]: Fix null handling in StringArray and BinaryArray filtering, add BinaryArray::from_opt_vec #8303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same pattern as in the handling for primative array: https://github.com/apache/arrow/pull/8303/files#diff-d7b0b7cde1850e8744ceda458c6dea81R294-L298
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is the code for BinaryArray that @nevi-me referred to in #8303 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, this special case appears to miss the null check too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note using an Option is likely to increase the temporary storage requirements a bit.
It would likely be possible to avoid this allocation entirely if we used the lower level ArrayBuilder::with_bit_buffer.
I chose to follow the style of the rest of this module, though I would love opinions on trying to perf check this / optimize it (maybe a follow on JIRA ticket is enough)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?
IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?
Yes, I believe you are correct.
This program:
fn main() {
println!("The size of a &str is {}", std::mem::size_of::<&str>());
println!("The size of an Option<&str> is {}", std::mem::size_of::<Option<&str>>());
}
Produces the following on my machine:
The size of a &str is 16
The size of an Option<&str> is 16
|
Thanks @alamb, this existed as https://issues.apache.org/jira/browse/ARROW-5352; so I'll close that out. The same behaviour would occur with a And yes, the issue didn't affect primitive arrays, I think it was because we were pushing an empty string where we should have pushed a null slot on the null bitmap |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, @alamb . Looks good so far. I left some minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?
IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.
rust/arrow/src/array/array.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this code tested somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is tested (indirectly) in https://github.com/apache/arrow/pull/8303/files#diff-d7b0b7cde1850e8744ceda458c6dea81R700 -- but I think a more specific test would be valuable. I will add one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turned out to be a great call @jorgecarleitao -- I found a bug in this implementation while writing a test. Thank you for the suggestion. 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that we test the 3 quantities: d.is_null(0), d.value(0), d.is_null(1). Alternatively,
let expected = StringArray::from(vec![Some("hello"), None]);
assert_eq!(d, expected);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
Thanks @nevi-me -- yes I updated the code tried to handle both types in the same way (and including tests) |
@jorgecarleitao I believe you are correct: This program: Produces the following on my machine: |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao you said:
IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.
Can you perhaps let me know what you mean by this? Perhaps there is an example in the code you are thinking of?
rust/arrow/src/array/array.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is tested (indirectly) in https://github.com/apache/arrow/pull/8303/files#diff-d7b0b7cde1850e8744ceda458c6dea81R700 -- but I think a more specific test would be valuable. I will add one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?
Yes, I believe you are correct.
This program:
fn main() {
println!("The size of a &str is {}", std::mem::size_of::<&str>());
println!("The size of an Option<&str> is {}", std::mem::size_of::<Option<&str>>());
}
Produces the following on my machine:
The size of a &str is 16
The size of an Option<&str> is 16
|
I think I have addressed all your (very helpful) comments @jorgecarleitao . |
I am really, sorry, @alamb , I should have offered more context in the first place. :/ This in no way blocks this PR: IMO it is ready to merge if the relevant tests pass. What I meant is that this code currently:
it may be more efficient to create the buffers during the iteration, so that we avoid the copy (Vec -> buffers). In other words, the code in (as a side note, this is why I am proposing #8211 : IMO there is some boiler-plate copy-pasting to
which will continue to grow as we add more kernels, and whose pattern seems to be a |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, really nice additions.
|
@jorgecarleitao -- yes thank you that makes a lot of sense. I have filed https://issues.apache.org/jira/browse/ARROW-10141 to track that |
|
@andygrove / @jorgecarleitao / @nevi-me I wonder if this PR might be merged anytime soon (I have a downstream project relying on this change) The integration test failure https://github.com/apache/arrow/pull/8303/checks?check_run_id=1187275161 seems due to a network failure (not anything with this PR): |
|
I'll rebase to try and get a clean test run |
53f04f0 to
b83d867
Compare
|
All tests are passing now. 🎉 |
When I use the
filterkernel with Null strings, any input column that was Null turns into an empty string after filtering.And the filter
Will result in
Rather than
It appears to work fine for primitive arrays (I'll comment inline). I also added
BinaryArray::from_opt_vecfollowing the model ofPrimativeArrayandStringArraymostly so I could write a test.