-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Implement regexp_matches_utf8
#706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #706 +/- ##
==========================================
+ Coverage 82.46% 82.48% +0.02%
==========================================
Files 168 168
Lines 47419 47528 +109
==========================================
+ Hits 39104 39204 +100
- Misses 8315 8324 +9
Continue to review full report at Codecov.
|
|
Thanks @b41sh -- I plan to review this tomorrow morning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @b41sh -- I think this PR is looking quite good -- the code and changes are easy to understand and well tested.
I do wonder about the name of these kernels as well as the need to have explicit "not matches" kernels. I am hoping for others to give feedback as well
| } | ||
|
|
||
| /// Perform SQL `array ~ regex_array` operation on [`StringArray`] / [`LargeStringArray`]. | ||
| pub fn regexp_matches_utf8<OffsetSize: StringOffsetSizeTrait>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in #697 (comment), the name regexp_matches_utf8 is pretty similar to regexp_match which I think might cause confusion
It looks like the rust regular expression library uses is_match so I wonder if following that model might be clearer?
So instead of regexp_matches_utf8 maybe calling this regexp_is_match_utf8 (and similarly below).
@seddonm1 / @nevi-me / @jorgecarleitao do you have any thoughts regarding the naming of a function that checks if values match regular expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, regexp_is_match_utf8 is a little bit better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-TABLE functionality which uses an Operator ~ to instead of a function. I think the kernel names are fine but it would be nice if we could retain the postgres semantics in order to achieve as much compatibility as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo achieving full compatibility with postgres is beyond the scope of the arrow-rs crate. IMO the least surprising option here is an implementation of the regex semantics declared in the regex crate. Users interested in the postgres semantics are a subset of all users - postgres semantics is something relatively specific to databases.
Also, afaik achieving postgres semantics is relatively difficult as it requires a regex parser to act as the postgres one (but I may be mistaken here).
A possible approach is to move both of these kernels to datafusion and keep a non-postgres, regex-crate specific one here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed jorge, sorry I had missed this was being added at the Arrow crate level. I had previously implemented the Postgres regex behavior in the Datafusion layer and agree that due to differences in regex semantics maybe this does not belong in the arrow-rs crate.
My main point would be from the user experience point of view it would be good to implement the 'abcd' ~ 'bc' style (even if this requires sqlparser updates) rather than implement new non-standard SQL functions.
| } | ||
|
|
||
| /// Perform SQL `array !~ regex_array` operation on [`StringArray`] / [`LargeStringArray`]. | ||
| pub fn regexp_not_matches_utf8<OffsetSize: StringOffsetSizeTrait>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need the negative variants (e.g. do we need regexp_not_matches_utf8) kernel? Perhaps we could simply use not(regexp_matches_utf8(..))
@alamb I have added some comment and remove function |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is looking great @b41sh -- thank you. I agree with @jorgecarleitao and @seddonm1 that this kernel should follow the model of the rust regex crate (which I think it does)
I'll plan to merge this tomorrow (and include it as part of arrow-rs 5.3.0) unless anyone has additional comments. Just let me know
| /// | ||
| /// `flags_array` are optional [`StringArray`] / [`LargeStringArray`] flag, which allow | ||
| /// special search modes, such as case insensitive and multi-line mode. | ||
| /// See the documentation [here](https://docs.rs/regex/1.5.4/regex/#grouping-and-flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Closes #697
Rationale for this change
What changes are included in this PR?
Implement arrow compute kernel comparison functions
regexp_matches_utf8,regexp_matches_utf8_scalar,regexp_not_matches_utf8,regexp_not_matches_utf8_scalarand test cases.Are there any user-facing changes?