ARROW-10103: [Rust] Add contains kernel #8280

nevi-me · 2020-09-26T01:54:17Z

Checks if a list contains a value in either a primitive or string

Large lists are also supported

Checks if a list contains a value in either a primitive or string Large lists are also supported

nevi-me · 2020-09-26T01:56:21Z

This is extracted from #6770,

PTAL @alamb @jorgecarleitao @jhorstmann

CC @mcassels @maxburke

nevi-me · 2020-09-26T01:57:33Z

rust/arrow/src/compute/kernels/comparison.rs

+        left.len(),
+        None,
+        None,
+        left.offset(),


Is it correct for us to reuse the offset, or should this be 0? My intuition says the latter, but I'm too tired to figure it out. Same applies above

I also think that this should be zero: we are building a new buffer (result.finish()) and thus there is no need to start from an offset. I would expect that a test with left.offset() != 0 to not pass with this code, as we read result's buffer from left's offset.

github-actions · 2020-09-26T02:06:14Z

https://issues.apache.org/jira/browse/ARROW-10103

jorgecarleitao

I went through this. Looks great. Thanks a lot, @nevi-me !

I think that compare_option_bitmap would benefit from some tests, and that there is a potential fix needed wrt to offsets.

jorgecarleitao · 2020-09-26T02:42:56Z

rust/arrow/src/compute/util.rs

+    let left = left_data.null_buffer();
+    let right = right_data.null_buffer();
+
+    if (left.is_some() && left_offset_in_bits % 8 != 0)


Out of curiosity: is this true in general, or is for these (this and above) implementation, that uses this assumption?

I'm trying to solve this problem in #8262

The issue is that buffer.slice and buffer_bin_or/and currently work with byte offsets, but for boolean arrays and bitmaps the offset can start in the middle of a byte.

rust/arrow/src/compute/util.rs

jorgecarleitao · 2020-09-26T02:47:46Z

rust/arrow/src/compute/kernels/comparison.rs

+            // contains(null, [null]) = true
+            if !bit_util::get_bit(left_null_bitmap, i) {
+                if list.null_count() > 0 {
+                    is_in = true;


I think that we can short-circuit here (result.append(true)?; break) (and equivalent instances). In rust's notation, I think that we are using the Iterator::any with some null logic sprinkled on top.

PTAL at how I've done it now. Instead of using a booelan builder, I set the bits directly on the buffer like we did with take

jorgecarleitao · 2020-09-26T02:53:31Z

rust/arrow/src/compute/kernels/comparison.rs

+        left.len(),
+        None,
+        None,
+        left.offset(),


I also think that this should be zero: we are building a new buffer (result.finish()) and thus there is no need to start from an offset. I would expect that a test with left.offset() != 0 to not pass with this code, as we read result's buffer from left's offset.

rust/arrow/src/compute/kernels/comparison.rs

jhorstmann · 2020-09-26T08:49:20Z

rust/arrow/src/compute/kernels/comparison.rs

+    for i in 0..left.len() {
+        let mut is_in = false;
+
+        // contains(null, null) = false


The sql behaviour regarding nulls would be different, if that is intended it should probably be noted a bit more prominently. In Postgres:

SELECT 'foo' = ANY(ARRAY['foo','bar']::text[]) AS non_null, 'foo' = ANY(ARRAY[]::text[]) AS empty, 'foo' = ANY(ARRAY[NULL]::text[]) AS null_value, 'foo' = ANY(NULL::text[]) AS null_array

| non_null | empty | null_value | null_array | | -------- | ----- | ---------- | ---------- | | true | false | | |

(empty cells in that table being null values)

I think the test is

SELECT 'foo' = ANY(ARRAY['foo','bar']::text[]) AS non_null, 'foo' = ANY(ARRAY[]::text[]) AS empty, NULL::text = ANY(ARRAY[NULL, 'foo']::text[]) AS null_value, 'foo' = ANY(NULL::text[]) AS null_array

but nonetheless it still yields the same result.
I copied this from another PR mostly-verbatim, so I'm only scrutinising the code today myself.

I would prefer aligning with the SQL behaviour, that a null value can't be contained in an array if the array has nulls.
If users want to find out if an array has nulls, they can use the below or something better:

let array: ListArray<i32> = ...; let ceil = bit_util::ceil(array.len(), 8); let mut bools = MutableBuffer::new(ceil). for i in -..array.len() { if array.is_valid(i) && array.value(i).null_count() > 0 { bools.set_bit(i); } } // create bool array from bools

nevi-me · 2020-09-26T09:23:21Z

@jhorstmann @jorgecarleitao I've had a look at this with a fresh head, and I've simplified the logic to:

Given a list array, return true if a non-null value exists in the array.

jhorstmann · 2020-09-27T10:20:04Z

Given a list array, return true if a non-null value exists in the array.

Intuitively this makes sense, unfortunately the sql rules are slightly more complex and can also return null in several cases. I'm ok with leaving it like this for now, but we might need to change it later to align more closely with sql.

velvia · 2020-09-29T23:31:47Z

@nevi-me Hi there, I'm teammates with @mcassels and @maxburke at UrbanLogiq, and we'd like to check in on the status of this PR. From looking over it, this is what I can gleam (the PR desc is just one sentence):

This just has kernel primitives for contains for generic strings and primitives
No specific support for dictionary-encoded string arrays? I guess this could be built on the primitive kernel?
Is a separate PR for the rest of the support, ie logical plan, etc. coming? Would be interested in building on top of this to finish the PR as we would like to use it on a newer branch.
(we'd be happy to finish the rest of it or work on the follow on PR)

Would love to have an ETA....

Thanks!
-Evan

nevi-me · 2020-09-29T23:42:10Z

Hi @velvia, I pulled these changes from UL's fork while I was rebasing the changes from #6770. I only refactored the existing kernel functions there, but didn't expand any supported data types.

The behaviour does change from what you had on your fork (#8280 (comment)), so perhaps you can comment on whether this is fine, or bear that in mind when updating your fork.

Regarding an ETA, the PR's still under review, so once it's approved, we'll be able to merge it in.

Yes, you can open JIRAs (https://issues.apache.org/jira/projects/ARROW) then work on top of this PR.

I've spent some time looking at the UL fork, and I think there might be changes there that the wider community would benefit from if you upstream them a bit more frequently. I understand that sometimes we might take longer than is ideal to complete PR reviews and merge them; but that's often a function of the sporadic availability of capacity from the Rust developers.

We've also been very reliant on 2 people on the Parquet side, but I've been spending more time on the codebase so I can start picking up PRs on Parquet.

velvia · 2020-09-30T17:43:04Z

@nevi-me thanks, will have a look

maxburke · 2020-10-01T05:49:35Z

I've spent some time looking at the UL fork, and I think there might be changes there that the wider community would benefit from if you upstream them a bit more frequently. I understand that sometimes we might take longer than is ideal to complete PR reviews and merge them; but that's often a function of the sporadic availability of capacity from the Rust developers.

We would love to get our changes into the mainline Arrow, and that is our goal!, we've just been very constriained lately in how much people-power we can use to get the patches into a state our committers are happy with 🙂

Slowly but surely we'll have more patches coming!

jorgecarleitao

LGTM

andygrove

LGTM

ARROW-10103: [Rust] Add contains kernel

a33460d

Checks if a list contains a value in either a primitive or string Large lists are also supported

nevi-me added the Component: Rust label Sep 26, 2020

nevi-me requested a review from andygrove September 26, 2020 01:56

nevi-me commented Sep 26, 2020

View reviewed changes

remove gated imports

b012544

jorgecarleitao reviewed Sep 26, 2020

View reviewed changes

jhorstmann reviewed Sep 26, 2020

View reviewed changes

rust/arrow/src/compute/kernels/comparison.rs Outdated Show resolved Hide resolved

jhorstmann reviewed Sep 26, 2020

View reviewed changes

rust/arrow/src/compute/kernels/comparison.rs Outdated Show resolved Hide resolved

jhorstmann reviewed Sep 26, 2020

View reviewed changes

rust/arrow/src/compute/kernels/comparison.rs Outdated Show resolved Hide resolved

jhorstmann reviewed Sep 26, 2020

View reviewed changes

simplify contains kernel

32fcaf6

jorgecarleitao approved these changes Oct 1, 2020

View reviewed changes

andygrove approved these changes Oct 2, 2020

View reviewed changes

andygrove closed this in ad712e5 Oct 2, 2020

asfimport mentioned this pull request Oct 2, 2020

[Rust] Add a Contains kernel #26117

Closed

ARROW-10103: [Rust] Add contains kernel #8280

ARROW-10103: [Rust] Add contains kernel #8280

Uh oh!

Conversation

nevi-me commented Sep 26, 2020

Uh oh!

nevi-me commented Sep 26, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 26, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nevi-me commented Sep 26, 2020

Uh oh!

jhorstmann commented Sep 27, 2020

Uh oh!

velvia commented Sep 29, 2020

Uh oh!

nevi-me commented Sep 29, 2020

Uh oh!

velvia commented Sep 30, 2020

Uh oh!

maxburke commented Oct 1, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants