-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Message extraction is slow for sufficiently large repos (consider parallelizing) #253
Comments
Without having looked too deeply into the code, I think it could be possible to split the calls inside extract_from_dir(..) to extract_from_file(..) into multiple processes, one per filename (somewhere near here: https://github.com/python-babel/babel/blob/master/babel/messages/extract.py#L143). |
Poked at the extraction code, wanted to get a sense of how much time is spent scanning and extracting strings vs actually writing them to the new file. Hypothesis (and hope): scanning/extracting takes significantly longer. This part of the process we can parallelize. We can't (easily) have concurrent writes to the same output file. Results: |
Thanks for the numbers, looks like an interesting case for optimisation. Once our workflow is a bit smoother, I will take a look on it. |
One can now supply a filename or a directory to be extracted. For large codebases, this allows the consumer to optimize their string extraction process by, for instance, only supplying the files that have actually been changed on the given dev's branch compared to master. Relates to python-babel#253 . I don't want to say "fixes", but makes further optimization unnecessary for most use cases.
A large repo I am working with currently takes ~30 seconds to build using xgettext (https://www.gnu.org/savannah-checkouts/gnu/gettext/manual/html_node/xgettext-Invocation.html). The same scan with pybabel extract takes ~2 min 49 secs.
The extraction process is, in both cases, blacklist-based, with the same directories blacklisted. That is, we scan the entire repo for new messages except those in a list of [ignore: foo/**.py] blocks.
The text was updated successfully, but these errors were encountered: