Skip to content

Conversation

@nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Sep 26, 2020

Checks if a list contains a value in either a primitive or string

Large lists are also supported

Checks if a list contains a value in either a primitive or string

Large lists are also supported
@nevi-me
Copy link
Contributor Author

nevi-me commented Sep 26, 2020

This is extracted from #6770,

PTAL @alamb @jorgecarleitao @jhorstmann

CC @mcassels @maxburke

left.len(),
None,
None,
left.offset(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct for us to reuse the offset, or should this be 0? My intuition says the latter, but I'm too tired to figure it out. Same applies above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that this should be zero: we are building a new buffer (result.finish()) and thus there is no need to start from an offset. I would expect that a test with left.offset() != 0 to not pass with this code, as we read result's buffer from left's offset.

@github-actions
Copy link

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this. Looks great. Thanks a lot, @nevi-me !

I think that compare_option_bitmap would benefit from some tests, and that there is a potential fix needed wrt to offsets.

let left = left_data.null_buffer();
let right = right_data.null_buffer();

if (left.is_some() && left_offset_in_bits % 8 != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: is this true in general, or is for these (this and above) implementation, that uses this assumption?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to solve this problem in #8262

The issue is that buffer.slice and buffer_bin_or/and currently work with byte offsets, but for boolean arrays and bitmaps the offset can start in the middle of a byte.

// contains(null, [null]) = true
if !bit_util::get_bit(left_null_bitmap, i) {
if list.null_count() > 0 {
is_in = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can short-circuit here (result.append(true)?; break) (and equivalent instances). In rust's notation, I think that we are using the Iterator::any with some null logic sprinkled on top.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL at how I've done it now. Instead of using a booelan builder, I set the bits directly on the buffer like we did with take

left.len(),
None,
None,
left.offset(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that this should be zero: we are building a new buffer (result.finish()) and thus there is no need to start from an offset. I would expect that a test with left.offset() != 0 to not pass with this code, as we read result's buffer from left's offset.

for i in 0..left.len() {
let mut is_in = false;

// contains(null, null) = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sql behaviour regarding nulls would be different, if that is intended it should probably be noted a bit more prominently. In Postgres:

SELECT
  'foo' = ANY(ARRAY['foo','bar']::text[]) AS non_null,
  'foo' = ANY(ARRAY[]::text[]) AS empty,
  'foo' = ANY(ARRAY[NULL]::text[]) AS null_value,
  'foo' = ANY(NULL::text[]) AS null_array
| non_null | empty | null_value | null_array |
| -------- | ----- | ---------- | ---------- |
| true     | false |            |            |

(empty cells in that table being null values)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test is

SELECT
        'foo' = ANY(ARRAY['foo','bar']::text[]) AS non_null,
        'foo' = ANY(ARRAY[]::text[]) AS empty,
        NULL::text = ANY(ARRAY[NULL, 'foo']::text[]) AS null_value,
        'foo' = ANY(NULL::text[]) AS null_array

but nonetheless it still yields the same result.
I copied this from another PR mostly-verbatim, so I'm only scrutinising the code today myself.

I would prefer aligning with the SQL behaviour, that a null value can't be contained in an array if the array has nulls.
If users want to find out if an array has nulls, they can use the below or something better:

let array: ListArray<i32> = ...;

let ceil = bit_util::ceil(array.len(), 8);
let mut bools = MutableBuffer::new(ceil).
for i in -..array.len() {
  if array.is_valid(i) && array.value(i).null_count() > 0 {
    bools.set_bit(i);
  }
}

// create bool array from bools

@nevi-me
Copy link
Contributor Author

nevi-me commented Sep 26, 2020

@jhorstmann @jorgecarleitao I've had a look at this with a fresh head, and I've simplified the logic to:

Given a list array, return true if a non-null value exists in the array.

@jhorstmann
Copy link
Contributor

Given a list array, return true if a non-null value exists in the array.

Intuitively this makes sense, unfortunately the sql rules are slightly more complex and can also return null in several cases. I'm ok with leaving it like this for now, but we might need to change it later to align more closely with sql.

@velvia
Copy link
Contributor

velvia commented Sep 29, 2020

@nevi-me Hi there, I'm teammates with @mcassels and @maxburke at UrbanLogiq, and we'd like to check in on the status of this PR. From looking over it, this is what I can gleam (the PR desc is just one sentence):

  • This just has kernel primitives for contains for generic strings and primitives
  • No specific support for dictionary-encoded string arrays? I guess this could be built on the primitive kernel?
  • Is a separate PR for the rest of the support, ie logical plan, etc. coming? Would be interested in building on top of this to finish the PR as we would like to use it on a newer branch.
    (we'd be happy to finish the rest of it or work on the follow on PR)

Would love to have an ETA....

Thanks!
-Evan

@nevi-me
Copy link
Contributor Author

nevi-me commented Sep 29, 2020

Hi @velvia, I pulled these changes from UL's fork while I was rebasing the changes from #6770. I only refactored the existing kernel functions there, but didn't expand any supported data types.

The behaviour does change from what you had on your fork (#8280 (comment)), so perhaps you can comment on whether this is fine, or bear that in mind when updating your fork.

Regarding an ETA, the PR's still under review, so once it's approved, we'll be able to merge it in.

Yes, you can open JIRAs (https://issues.apache.org/jira/projects/ARROW) then work on top of this PR.

I've spent some time looking at the UL fork, and I think there might be changes there that the wider community would benefit from if you upstream them a bit more frequently. I understand that sometimes we might take longer than is ideal to complete PR reviews and merge them; but that's often a function of the sporadic availability of capacity from the Rust developers.

We've also been very reliant on 2 people on the Parquet side, but I've been spending more time on the codebase so I can start picking up PRs on Parquet.

@velvia
Copy link
Contributor

velvia commented Sep 30, 2020

@nevi-me thanks, will have a look

@maxburke
Copy link
Contributor

maxburke commented Oct 1, 2020

I've spent some time looking at the UL fork, and I think there might be changes there that the wider community would benefit from if you upstream them a bit more frequently. I understand that sometimes we might take longer than is ideal to complete PR reviews and merge them; but that's often a function of the sporadic availability of capacity from the Rust developers.

We would love to get our changes into the mainline Arrow, and that is our goal!, we've just been very constriained lately in how much people-power we can use to get the patches into a state our committers are happy with 🙂

Slowly but surely we'll have more patches coming!

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants