-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10103: [Rust] Add contains kernel #8280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Checks if a list contains a value in either a primitive or string Large lists are also supported
|
This is extracted from #6770, |
| left.len(), | ||
| None, | ||
| None, | ||
| left.offset(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct for us to reuse the offset, or should this be 0? My intuition says the latter, but I'm too tired to figure it out. Same applies above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that this should be zero: we are building a new buffer (result.finish()) and thus there is no need to start from an offset. I would expect that a test with left.offset() != 0 to not pass with this code, as we read result's buffer from left's offset.
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this. Looks great. Thanks a lot, @nevi-me !
I think that compare_option_bitmap would benefit from some tests, and that there is a potential fix needed wrt to offsets.
| let left = left_data.null_buffer(); | ||
| let right = right_data.null_buffer(); | ||
|
|
||
| if (left.is_some() && left_offset_in_bits % 8 != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: is this true in general, or is for these (this and above) implementation, that uses this assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to solve this problem in #8262
The issue is that buffer.slice and buffer_bin_or/and currently work with byte offsets, but for boolean arrays and bitmaps the offset can start in the middle of a byte.
| // contains(null, [null]) = true | ||
| if !bit_util::get_bit(left_null_bitmap, i) { | ||
| if list.null_count() > 0 { | ||
| is_in = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can short-circuit here (result.append(true)?; break) (and equivalent instances). In rust's notation, I think that we are using the Iterator::any with some null logic sprinkled on top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL at how I've done it now. Instead of using a booelan builder, I set the bits directly on the buffer like we did with take
| left.len(), | ||
| None, | ||
| None, | ||
| left.offset(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that this should be zero: we are building a new buffer (result.finish()) and thus there is no need to start from an offset. I would expect that a test with left.offset() != 0 to not pass with this code, as we read result's buffer from left's offset.
| for i in 0..left.len() { | ||
| let mut is_in = false; | ||
|
|
||
| // contains(null, null) = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sql behaviour regarding nulls would be different, if that is intended it should probably be noted a bit more prominently. In Postgres:
SELECT
'foo' = ANY(ARRAY['foo','bar']::text[]) AS non_null,
'foo' = ANY(ARRAY[]::text[]) AS empty,
'foo' = ANY(ARRAY[NULL]::text[]) AS null_value,
'foo' = ANY(NULL::text[]) AS null_array| non_null | empty | null_value | null_array |
| -------- | ----- | ---------- | ---------- |
| true | false | | |
(empty cells in that table being null values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test is
SELECT
'foo' = ANY(ARRAY['foo','bar']::text[]) AS non_null,
'foo' = ANY(ARRAY[]::text[]) AS empty,
NULL::text = ANY(ARRAY[NULL, 'foo']::text[]) AS null_value,
'foo' = ANY(NULL::text[]) AS null_arraybut nonetheless it still yields the same result.
I copied this from another PR mostly-verbatim, so I'm only scrutinising the code today myself.
I would prefer aligning with the SQL behaviour, that a null value can't be contained in an array if the array has nulls.
If users want to find out if an array has nulls, they can use the below or something better:
let array: ListArray<i32> = ...;
let ceil = bit_util::ceil(array.len(), 8);
let mut bools = MutableBuffer::new(ceil).
for i in -..array.len() {
if array.is_valid(i) && array.value(i).null_count() > 0 {
bools.set_bit(i);
}
}
// create bool array from bools|
@jhorstmann @jorgecarleitao I've had a look at this with a fresh head, and I've simplified the logic to: Given a list array, return true if a non-null value exists in the array. |
Intuitively this makes sense, unfortunately the sql rules are slightly more complex and can also return null in several cases. I'm ok with leaving it like this for now, but we might need to change it later to align more closely with sql. |
|
@nevi-me Hi there, I'm teammates with @mcassels and @maxburke at UrbanLogiq, and we'd like to check in on the status of this PR. From looking over it, this is what I can gleam (the PR desc is just one sentence):
Would love to have an ETA.... Thanks! |
|
Hi @velvia, I pulled these changes from UL's fork while I was rebasing the changes from #6770. I only refactored the existing kernel functions there, but didn't expand any supported data types. The behaviour does change from what you had on your fork (#8280 (comment)), so perhaps you can comment on whether this is fine, or bear that in mind when updating your fork. Regarding an ETA, the PR's still under review, so once it's approved, we'll be able to merge it in. Yes, you can open JIRAs (https://issues.apache.org/jira/projects/ARROW) then work on top of this PR. I've spent some time looking at the UL fork, and I think there might be changes there that the wider community would benefit from if you upstream them a bit more frequently. I understand that sometimes we might take longer than is ideal to complete PR reviews and merge them; but that's often a function of the sporadic availability of capacity from the Rust developers. We've also been very reliant on 2 people on the Parquet side, but I've been spending more time on the codebase so I can start picking up PRs on Parquet. |
|
@nevi-me thanks, will have a look |
We would love to get our changes into the mainline Arrow, and that is our goal!, we've just been very constriained lately in how much people-power we can use to get the patches into a state our committers are happy with 🙂 Slowly but surely we'll have more patches coming! |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Checks if a list contains a value in either a primitive or string
Large lists are also supported