Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Sep 29, 2020

When I use the filter kernel with Null strings, any input column that was Null turns into an empty string after filtering.

"foo"
"bar"
NULL

And the filter

true
true
true

Will result in

"foo"
"bar"
""

Rather than

"foo"
"bar"
NULL

It appears to work fine for primitive arrays (I'll comment inline). I also added BinaryArray::from_opt_vec following the model of PrimativeArray and StringArray mostly so I could write a test.

@github-actions
Copy link

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is the code for BinaryArray that @nevi-me referred to in #8303 (comment))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, this special case appears to miss the null check too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note using an Option is likely to increase the temporary storage requirements a bit.

It would likely be possible to avoid this allocation entirely if we used the lower level ArrayBuilder::with_bit_buffer.

I chose to follow the style of the rest of this module, though I would love opinions on trying to perf check this / optimize it (maybe a follow on JIRA ticket is enough)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?

IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?

Yes, I believe you are correct.

This program:

fn main() {
    println!("The size of a &str is {}", std::mem::size_of::<&str>());
    println!("The size of an Option<&str> is {}", std::mem::size_of::<Option<&str>>());
}

Produces the following on my machine:

The size of a &str is 16
The size of an Option<&str> is 16

@nevi-me
Copy link
Contributor

nevi-me commented Sep 29, 2020

Thanks @alamb, this existed as https://issues.apache.org/jira/browse/ARROW-5352; so I'll close that out.

The same behaviour would occur with a BinaryArray, I haven't looked at this PR, but it's worthwhile to ensure that StringArray and BinaryArray are treated consistently.

And yes, the issue didn't affect primitive arrays, I think it was because we were pushing an empty string where we should have pushed a null slot on the null bitmap

@nevi-me nevi-me self-requested a review September 29, 2020 23:19
@nevi-me nevi-me changed the title ARROW-10136: [Rust][Arrow]: Fix null handling in StringArray and BinaryArray filtering, add BinaryArray::from_opt_vec ARROW-10136: [Rust]: Fix null handling in StringArray and BinaryArray filtering, add BinaryArray::from_opt_vec Sep 29, 2020
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @alamb . Looks good so far. I left some minor comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?

IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this code tested somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is tested (indirectly) in https://github.com/apache/arrow/pull/8303/files#diff-d7b0b7cde1850e8744ceda458c6dea81R700 -- but I think a more specific test would be valuable. I will add one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turned out to be a great call @jorgecarleitao -- I found a bug in this implementation while writing a test. Thank you for the suggestion. 💯

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we test the 3 quantities: d.is_null(0), d.value(0), d.is_null(1). Alternatively,

let expected = StringArray::from(vec![Some("hello"), None]);
assert_eq!(d, expected);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2020

The same behaviour would occur with a BinaryArray, I haven't looked at this PR, but it's worthwhile to ensure that StringArray and BinaryArray are treated consistently.

Thanks @nevi-me -- yes I updated the code tried to handle both types in the same way (and including tests)

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2020

Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?

@jorgecarleitao I believe you are correct:

This program:

fn main() {
    println!("The size of a &str is {}", std::mem::size_of::<&str>());
    println!("The size of an Option<&str> is {}", std::mem::size_of::<Option<&str>>());
}

Produces the following on my machine:

The size of a &str is 16
The size of an Option<&str> is 16

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao you said:

IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.

Can you perhaps let me know what you mean by this? Perhaps there is an example in the code you are thinking of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is tested (indirectly) in https://github.com/apache/arrow/pull/8303/files#diff-d7b0b7cde1850e8744ceda458c6dea81R700 -- but I think a more specific test would be valuable. I will add one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?

Yes, I believe you are correct.

This program:

fn main() {
    println!("The size of a &str is {}", std::mem::size_of::<&str>());
    println!("The size of an Option<&str> is {}", std::mem::size_of::<Option<&str>>());
}

Produces the following on my machine:

The size of a &str is 16
The size of an Option<&str> is 16

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2020

I think I have addressed all your (very helpful) comments @jorgecarleitao .

@jorgecarleitao
Copy link
Member

@jorgecarleitao you said:

IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.

Can you perhaps let me know what you mean by this? Perhaps there is an example in the code you are thinking of?

I am really, sorry, @alamb , I should have offered more context in the first place. :/

This in no way blocks this PR: IMO it is ready to merge if the relevant tests pass.

What I meant is that this code currently:

  • creates Vec<Option<T>> through an iteration
  • copies Vec<Option<T>> to the two buffers (when from_opt_vec is called)

it may be more efficient to create the buffers during the iteration, so that we avoid the copy (Vec -> buffers). In other words, the code in from_opt_vec could have been "injected" into the filter execution, where the MuttableBuffer and offsets and values buffer are created before the loop, and new elements are directly written to it. Does this any sense?

(as a side note, this is why I am proposing #8211 : IMO there is some boiler-plate copy-pasting to

  1. initialize buffers
  2. iterate
  3. create ArrayData from buffers

which will continue to grow as we add more kernels, and whose pattern seems to be a FromIter of fixed size)

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, really nice additions.

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2020

@jorgecarleitao -- yes thank you that makes a lot of sense. I have filed https://issues.apache.org/jira/browse/ARROW-10141 to track that

@alamb
Copy link
Contributor Author

alamb commented Oct 1, 2020

@andygrove / @jorgecarleitao / @nevi-me I wonder if this PR might be merged anytime soon (I have a downstream project relying on this change)

The integration test failure https://github.com/apache/arrow/pull/8303/checks?check_run_id=1187275161 seems due to a network failure (not anything with this PR):

Error:  Failed to execute goal org.apache.maven.plugins:maven-site-plugin:3.5.1:attach-descriptor (attach-descriptor) on project arrow-java-root: Execution attach-descriptor of goal org.apache.maven.plugins:maven-site-plugin:3.5.1:attach-descriptor failed: Plugin org.apache.maven.plugins:maven-site-plugin:3.5.1 or one of its dependencies could not be resolved: Could not transfer artifact org.apache.maven:maven-archiver:jar:2.5 from/to central (https://repo.maven.apache.org/maven2): Connection reset -> [Help 1]

@alamb
Copy link
Contributor Author

alamb commented Oct 1, 2020

I'll rebase to try and get a clean test run

@alamb alamb force-pushed the alamb/ARROW-10136-null-filter branch from 53f04f0 to b83d867 Compare October 1, 2020 12:52
@alamb
Copy link
Contributor Author

alamb commented Oct 1, 2020

All tests are passing now. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants