Message extraction is slow for sufficiently large repos (consider parallelizing) #253

ENuge · 2015-09-24T21:44:34Z

A large repo I am working with currently takes ~30 seconds to build using xgettext (https://www.gnu.org/savannah-checkouts/gnu/gettext/manual/html_node/xgettext-Invocation.html). The same scan with pybabel extract takes ~2 min 49 secs.

The extraction process is, in both cases, blacklist-based, with the same directories blacklisted. That is, we scan the entire repo for new messages except those in a list of [ignore: foo/**.py] blocks.

ENuge · 2015-09-24T21:48:12Z

Without having looked too deeply into the code, I think it could be possible to split the calls inside extract_from_dir(..) to extract_from_file(..) into multiple processes, one per filename (somewhere near here: https://github.com/python-babel/babel/blob/master/babel/messages/extract.py#L143).

ENuge · 2015-09-25T18:29:17Z

Poked at the extraction code, wanted to get a sense of how much time is spent scanning and extracting strings vs actually writing them to the new file.

Hypothesis (and hope): scanning/extracting takes significantly longer. This part of the process we can parallelize. We can't (easily) have concurrent writes to the same output file.

Results:
Total time extracting: 159.926753521 secs
Total time writing: 0.707601547241 secs
Code I ran: http://pastebin.com/5uSC15MN

etanol · 2015-09-25T20:16:10Z

Thanks for the numbers, looks like an interesting case for optimisation. Once our workflow is a bit smoother, I will take a look on it.

One can now supply a filename or a directory to be extracted. For large codebases, this allows the consumer to optimize their string extraction process by, for instance, only supplying the files that have actually been changed on the given dev's branch compared to master. Relates to python-babel#253 . I don't want to say "fixes", but makes further optimization unnecessary for most use cases.

sils added enhancement (1) investigated labels Sep 24, 2015

sils removed the (1) investigated label Sep 24, 2015

ENuge mentioned this issue Jan 12, 2016

Support extraction by filename as well as directory #324

Merged

akx mentioned this issue May 5, 2016

Add --ignore-dirs or similar to extract #402

Closed

akx added the area/messages label Feb 2, 2018

kinshukdua mentioned this issue Oct 3, 2021

add --ignore-dirs to extract #813

Closed

akx mentioned this issue Jan 27, 2022

Implement directory filter for extract #832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message extraction is slow for sufficiently large repos (consider parallelizing) #253

Message extraction is slow for sufficiently large repos (consider parallelizing) #253

ENuge commented Sep 24, 2015

ENuge commented Sep 24, 2015

ENuge commented Sep 25, 2015

etanol commented Sep 25, 2015

Message extraction is slow for sufficiently large repos (consider parallelizing) #253

Message extraction is slow for sufficiently large repos (consider parallelizing) #253

Comments

ENuge commented Sep 24, 2015

ENuge commented Sep 24, 2015

ENuge commented Sep 25, 2015

etanol commented Sep 25, 2015