-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Add .str.count_matches()
#2580
Conversation
pub fn count_matches( | ||
&self, | ||
patterns: &Self, | ||
whole_word: bool, | ||
case_sensitive: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it'd be simpler to instead accept a regex?
the regex crate uses aho-corasick
under the hood, so the perf implications should be negligible as long as we're only compiling the regex once.
Regex would likely be a lot more intuitive/flexible from an end user perspective as well.
res = s.str.count_matches('\b(fox|over|lazy dog|dog)\b').to_pylist()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll change it to that. I didn't originally do this because I was worried about performance - I'll run some tests to make sure it isn't too affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it does seem like using a regex is around 7x slower. I guess it just can't handle the large pattern created by concatenating the list of strings. So I think I'll keep it this way for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though it's slower, I think we should still allow for regex just for usability sake.
We can do this in a follow up PR, and have the python frontend map to a different backend implementation
count_matches('<text>') -> count_matches
count_matches(r'<pattern>') -> count_matches_regex
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2580 +/- ##
=======================================
Coverage ? 63.99%
=======================================
Files ? 953
Lines ? 108291
Branches ? 0
=======================================
Hits ? 69298
Misses ? 38993
Partials ? 0
|
Adds a method to count the number of appearances of some patterns in a column of strings. An example usage is for dirty word counting for preprocessing data.