Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce some calls to re.sub #50

Merged
merged 4 commits into from
Nov 13, 2019

Conversation

dakinggg
Copy link
Contributor

@dakinggg dakinggg commented Nov 8, 2019

So calls to re.compile are not a problem. The main thing slowing it down is lots of calls to re.sub in abbreviation_replacer.py. I reduced some of these calls which speeds it up by a factor of ~3-3.5x on my machine, for the specific (longish) document that I tested with. I also included the script I used to test timing. Given that you are much more familiar with the codebase, see if my changes look reasonable, but all the tests do still pass. There are probably some more ways to speed up the calls in that file.

@nipunsadvilkar
Copy link
Owner

LGTM! Yes, I concur there are a lot of inefficient loops and calls happening in abbreviation_replacer.py. Since I was porting it from ruby, I thought of keeping the same loops and calls in python to get tests to pass. In a few cases, I tried to be idiomatic wherever possible, but still, there are a lot of places where we can surely improve pysbd.

It would be nice if you could list those speed up calls in the comment here so we can tackle them one by one.

@dakinggg
Copy link
Contributor Author

I was focusing on the parts that were the majority of the time spent. Figured no need to optimize parts that weren't taking up much of the time. Here is some of the output from cProfile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.484    1.484 <string>:1(<module>)
      366    0.003    0.000    0.258    0.001 abbreviation_replacer.py:10(replace_pre_number_abbr)
     1492    0.004    0.000    1.145    0.001 abbreviation_replacer.py:103(scan_for_replacements)
      222    0.002    0.000    0.157    0.001 abbreviation_replacer.py:19(replace_prepositive_abbr)
      904    0.007    0.000    0.726    0.001 abbreviation_replacer.py:28(replace_period_of_abbr)
        1    0.000    0.000    0.013    0.013 abbreviation_replacer.py:43(replace_abbreviation_as_sentence_boundary)
        1    0.000    0.000    0.000    0.000 abbreviation_replacer.py:54(__init__)
        1    0.000    0.000    1.400    1.400 abbreviation_replacer.py:58(replace)
        1    0.000    0.000    0.001    0.001 abbreviation_replacer.py:70(replace_multi_period_abbreviations)
        4    0.000    0.000    0.000    0.000 abbreviation_replacer.py:71(mpa_replace)
        1    0.005    0.005    1.380    1.380 abbreviation_replacer.py:83(search_for_abbreviations_in_string)

Didn't want to cause code churn/potentially introduce bugs by over optimizing other pieces of code. I also don't actually have clear ideas at the moment for how to make those slow pieces even faster.

@dakinggg
Copy link
Contributor Author

Any hesitation on releasing this as is?

@nipunsadvilkar
Copy link
Owner

Although, I wanted to make more optimizations in abbreviation_replacer.py but didn't find much time to do it thoroughly so releasing it for now and we can do further optimizations in the future.

Just to segregate scripts as per purpose, moving your test_timing_script.py to examples for further reference.

@nipunsadvilkar nipunsadvilkar merged commit f7c640f into nipunsadvilkar:master Nov 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants