Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: speed up g2p processing by caching the results #464

Merged
merged 1 commit into from
Jun 13, 2024

Conversation

joanise
Copy link
Member

@joanise joanise commented Jun 13, 2024

PR Goal?

Processing g2p is pretty slow, well, because g2p-ing text is fairly expensive, it was not implemented to be a super fast library.

However, while processing a corpus, the number of types should be much smaller than the number of tokens (at least for languages that are not heavily agglutinative or morphologically rich) and therefore we can speed up processing by caching the results on a token by token basis.

For English, the results are great: the 13k sentences in the LJ corpus get processes at a rate of 700 items per second before, a 5k items per second with this PR.

For agglutinative and morphologically rich languages, benefits will be smaller, but it's really only for large and very large corpora that this optimization matters. And sometimes, we'll just have to be patient anyway.

Fixes?

Fixes: #446

Feedback sought?

I tested on English only. Please test on data from some other languages to make sure I didn't break anything. The filelist output of the new-project wizard should be identical before and after, with only the speed of the "Processing your characters:" step changing.

I didn't explicitly test generating phonological features, which depends on the same code, so if someone could do that it would be great.

Priority?

normal

Tests added?

Was already well covered by unit testing, for which I am very grateful.

How to test?

Run the wizard and pick "characters" as the representation.

Also, trigger the use of calculate_phonological_features(), which I don't know how to do off the top of my head.

Confidence?

medium-high:

  • wizard: high: I'm confident the wizard is good
  • pfs: medium: I'm less confident for pfs since I have not exercised it except via existing unit tests, if there are any

But overall high nonetheless, because my first implementation had a bug and tripped in three different unit test cases. Once that got fixed, the output test-filelist.psv from LJ through the wizard was bit-for-bit identical before and after.

Version change?

no

Related PRs?

no

The cache is implemented on a token basis to keep the results
identical but maximize reuse potential.

Fixes: #446
Copy link
Contributor

CLI load time: 0:00.24
Pull Request HEAD: 32a48131c4ad41b39975fbd198cd0645258e7a8c
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

Copy link

codecov bot commented Jun 13, 2024

Codecov Report

Attention: Patch coverage is 96.87500% with 1 line in your changes missing coverage. Please review.

Project coverage is 74.28%. Comparing base (e044cd5) to head (32a4813).

Files Patch % Lines
everyvoice/text/phonemizer.py 96.87% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #464      +/-   ##
==========================================
+ Coverage   74.15%   74.28%   +0.13%     
==========================================
  Files          44       44              
  Lines        2766     2780      +14     
  Branches      428      430       +2     
==========================================
+ Hits         2051     2065      +14     
  Misses        630      630              
  Partials       85       85              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@roedoejet roedoejet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! tested, and much faster (1.5hrs -> 5 minutes on a 110k French corpus)

another speed issue raised to @joanise directly about validatewavs

@joanise
Copy link
Member Author

joanise commented Jun 13, 2024

Discussion with AP about memory:
This PR never clears the cache. EJ Q: is that going to be an issue? Not likely given the other structures we keep in memory.
Test done, with 110k sentences in the cml_tts_dataset_french_v0.1 corpus, and even after doing the whole wizard, we're still using about half a GB of RAM, so memory is not a problem.
I did a quick hack to clear the cache, and that actually had no effect at all: memory did not get released.
So we declared that whole question YAGNI, but I'm leaving the note here in case we wonder about this later.

@joanise joanise merged commit 689f324 into main Jun 13, 2024
6 checks passed
@joanise joanise deleted the dev.ej/446-speed-up-g2p branch June 13, 2024 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Processing characters in wizard is very slow
2 participants