Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message extraction is slow for sufficiently large repos (consider parallelizing) #253

Open
ENuge opened this issue Sep 24, 2015 · 3 comments

Comments

@ENuge
Copy link
Contributor

ENuge commented Sep 24, 2015

A large repo I am working with currently takes ~30 seconds to build using xgettext (https://www.gnu.org/savannah-checkouts/gnu/gettext/manual/html_node/xgettext-Invocation.html). The same scan with pybabel extract takes ~2 min 49 secs.

The extraction process is, in both cases, blacklist-based, with the same directories blacklisted. That is, we scan the entire repo for new messages except those in a list of [ignore: foo/**.py] blocks.

@ENuge
Copy link
Contributor Author

ENuge commented Sep 24, 2015

Without having looked too deeply into the code, I think it could be possible to split the calls inside extract_from_dir(..) to extract_from_file(..) into multiple processes, one per filename (somewhere near here: https://github.com/python-babel/babel/blob/master/babel/messages/extract.py#L143).

@ENuge
Copy link
Contributor Author

ENuge commented Sep 25, 2015

Poked at the extraction code, wanted to get a sense of how much time is spent scanning and extracting strings vs actually writing them to the new file.

Hypothesis (and hope): scanning/extracting takes significantly longer. This part of the process we can parallelize. We can't (easily) have concurrent writes to the same output file.

Results:
Total time extracting: 159.926753521 secs
Total time writing: 0.707601547241 secs
Code I ran: http://pastebin.com/5uSC15MN

@etanol
Copy link
Contributor

etanol commented Sep 25, 2015

Thanks for the numbers, looks like an interesting case for optimisation. Once our workflow is a bit smoother, I will take a look on it.

ENuge pushed a commit to ENuge/babel that referenced this issue Jan 14, 2016
One can now supply a filename or a directory to be extracted. For
large codebases, this allows the consumer to optimize their
string extraction process by, for instance, only supplying the
files that have actually been changed on the given dev's branch
compared to master.

Relates to python-babel#253 . I
don't want to say "fixes", but makes further optimization
unnecessary for most use cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants