Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add .str.count_matches() #2580

Merged
merged 6 commits into from
Jul 30, 2024
Merged

[FEAT] Add .str.count_matches() #2580

merged 6 commits into from
Jul 30, 2024

Conversation

Vince7778
Copy link
Contributor

Adds a method to count the number of appearances of some patterns in a column of strings. An example usage is for dirty word counting for preprocessing data.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 29, 2024
@Vince7778 Vince7778 changed the title Add .str.count_matches() [FEAT] Add .str.count_matches() Jul 29, 2024
@Vince7778 Vince7778 requested a review from jaychia July 29, 2024 22:50
@github-actions github-actions bot added the enhancement New feature or request label Jul 29, 2024
Comment on lines +1388 to +1392
pub fn count_matches(
&self,
patterns: &Self,
whole_word: bool,
case_sensitive: bool,
Copy link
Collaborator

@universalmind303 universalmind303 Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'd be simpler to instead accept a regex?

the regex crate uses aho-corasick under the hood, so the perf implications should be negligible as long as we're only compiling the regex once.

Regex would likely be a lot more intuitive/flexible from an end user perspective as well.

res = s.str.count_matches('\b(fox|over|lazy dog|dog)\b').to_pylist()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll change it to that. I didn't originally do this because I was worried about performance - I'll run some tests to make sure it isn't too affected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it does seem like using a regex is around 7x slower. I guess it just can't handle the large pattern created by concatenating the list of strings. So I think I'll keep it this way for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it's slower, I think we should still allow for regex just for usability sake.

We can do this in a follow up PR, and have the python frontend map to a different backend implementation

count_matches('<text>') -> count_matches
count_matches(r'<pattern>') -> count_matches_regex

Copy link

codecov bot commented Jul 30, 2024

Codecov Report

Attention: Patch coverage is 81.02190% with 26 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@ddabd34). Learn more about missing BASE report.

Files Patch % Lines
src/daft-functions/src/count_matches.rs 64.91% 20 Missing ⚠️
daft/series.py 66.66% 3 Missing ⚠️
src/daft-core/src/array/ops/utf8.rs 92.30% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2580   +/-   ##
=======================================
  Coverage        ?   63.99%           
=======================================
  Files           ?      953           
  Lines           ?   108291           
  Branches        ?        0           
=======================================
  Hits            ?    69298           
  Misses          ?    38993           
  Partials        ?        0           
Files Coverage Δ
daft/expressions/expressions.py 93.30% <100.00%> (ø)
src/daft-core/src/python/series.rs 93.95% <100.00%> (ø)
src/daft-core/src/series/ops/utf8.rs 94.00% <100.00%> (ø)
src/daft-functions/src/lib.rs 61.11% <100.00%> (ø)
daft/series.py 89.38% <66.66%> (ø)
src/daft-core/src/array/ops/utf8.rs 93.17% <92.30%> (ø)
src/daft-functions/src/count_matches.rs 64.91% <64.91%> (ø)

@Vince7778 Vince7778 enabled auto-merge (squash) July 30, 2024 21:04
@Vince7778 Vince7778 merged commit 8544b76 into main Jul 30, 2024
44 checks passed
@Vince7778 Vince7778 deleted the conor/str-count-matches branch July 30, 2024 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants