Skip to content

Conversation

@b41sh
Copy link
Contributor

@b41sh b41sh commented Aug 22, 2021

Which issue does this PR close?

Closes #697

Rationale for this change

What changes are included in this PR?

Implement arrow compute kernel comparison functions regexp_matches_utf8, regexp_matches_utf8_scalar, regexp_not_matches_utf8, regexp_not_matches_utf8_scalar and test cases.

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 22, 2021
@codecov-commenter
Copy link

codecov-commenter commented Aug 22, 2021

Codecov Report

Merging #706 (b56b6ef) into master (8308615) will increase coverage by 0.02%.
The diff coverage is 91.74%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #706      +/-   ##
==========================================
+ Coverage   82.46%   82.48%   +0.02%     
==========================================
  Files         168      168              
  Lines       47419    47528     +109     
==========================================
+ Hits        39104    39204     +100     
- Misses       8315     8324       +9     
Impacted Files Coverage Δ
arrow/src/compute/kernels/comparison.rs 95.08% <91.74%> (-0.76%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8308615...b56b6ef. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented Aug 23, 2021

Thanks @b41sh -- I plan to review this tomorrow morning

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @b41sh -- I think this PR is looking quite good -- the code and changes are easy to understand and well tested.

I do wonder about the name of these kernels as well as the need to have explicit "not matches" kernels. I am hoping for others to give feedback as well

}

/// Perform SQL `array ~ regex_array` operation on [`StringArray`] / [`LargeStringArray`].
pub fn regexp_matches_utf8<OffsetSize: StringOffsetSizeTrait>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #697 (comment), the name regexp_matches_utf8 is pretty similar to regexp_match which I think might cause confusion

It looks like the rust regular expression library uses is_match so I wonder if following that model might be clearer?

So instead of regexp_matches_utf8 maybe calling this regexp_is_match_utf8 (and similarly below).

@seddonm1 / @nevi-me / @jorgecarleitao do you have any thoughts regarding the naming of a function that checks if values match regular expressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, regexp_is_match_utf8 is a little bit better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-TABLE functionality which uses an Operator ~ to instead of a function. I think the kernel names are fine but it would be nice if we could retain the postgres semantics in order to achieve as much compatibility as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo achieving full compatibility with postgres is beyond the scope of the arrow-rs crate. IMO the least surprising option here is an implementation of the regex semantics declared in the regex crate. Users interested in the postgres semantics are a subset of all users - postgres semantics is something relatively specific to databases.

Also, afaik achieving postgres semantics is relatively difficult as it requires a regex parser to act as the postgres one (but I may be mistaken here).

A possible approach is to move both of these kernels to datafusion and keep a non-postgres, regex-crate specific one here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed jorge, sorry I had missed this was being added at the Arrow crate level. I had previously implemented the Postgres regex behavior in the Datafusion layer and agree that due to differences in regex semantics maybe this does not belong in the arrow-rs crate.

My main point would be from the user experience point of view it would be good to implement the 'abcd' ~ 'bc' style (even if this requires sqlparser updates) rather than implement new non-standard SQL functions.

}

/// Perform SQL `array !~ regex_array` operation on [`StringArray`] / [`LargeStringArray`].
pub fn regexp_not_matches_utf8<OffsetSize: StringOffsetSizeTrait>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need the negative variants (e.g. do we need regexp_not_matches_utf8) kernel? Perhaps we could simply use not(regexp_matches_utf8(..))

@b41sh
Copy link
Contributor Author

b41sh commented Aug 24, 2021

Thank you @b41sh -- I think this PR is looking quite good -- the code and changes are easy to understand and well tested.

I do wonder about the name of these kernels as well as the need to have explicit "not matches" kernels. I am hoping for others to give feedback as well

@alamb I have added some comment and remove function regexp_not_matches_utf8. PTAL

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is looking great @b41sh -- thank you. I agree with @jorgecarleitao and @seddonm1 that this kernel should follow the model of the rust regex crate (which I think it does)

I'll plan to merge this tomorrow (and include it as part of arrow-rs 5.3.0) unless anyone has additional comments. Just let me know

///
/// `flags_array` are optional [`StringArray`] / [`LargeStringArray`] flag, which allow
/// special search modes, such as case insensitive and multi-line mode.
/// See the documentation [here](https://docs.rs/regex/1.5.4/regex/#grouping-and-flags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 48c7529 into apache:master Aug 26, 2021
alamb pushed a commit that referenced this pull request Aug 26, 2021
* impl regexp_matches_utf8

* fix clippy

* add bench

* optimize
alamb added a commit that referenced this pull request Aug 26, 2021
* impl regexp_matches_utf8

* fix clippy

* add bench

* optimize

Co-authored-by: baishen <baishen2009@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add optimized filter kernel for regular expression matching

5 participants