Evaluate documents in parallel #65

osma · 2018-03-20T14:58:47Z

We could perhaps speed up the evaldir command by making use of multiple CPUs, using the multiprocessing module. There would be a pool of workers (as many as there are CPU cores) and documents would be handed to the workers for evaluation.

osma · 2018-03-23T10:50:04Z

Logging is a bit of a challenge, but there is https://pypi.python.org/pypi/multiprocessing-logging/ that might help

kinow · 2019-05-12T07:13:08Z

For me the evaluation of a directory took a while, but I suspect over 80% of the time was loading the vectorizer from disk.

Is the idea here to perhaps first pre-load the model, and then evaluate the documents in parallel?

osma · 2019-05-13T13:14:09Z

@kinow Yes, the initialization time (loading vectorizer, models etc) tends to dominate when you evaluate with a small amount of documents. Possibly some of this initialization could be parallelized as well...

Anyways, the idea of this feature was to load the model first, then evaluate the documents in parallel. With a large enough set of documents (thousands?) there should be a significant reduction in overall time spent on evaluation.

osma · 2020-05-27T18:11:35Z

The main thing holding this back currently is that Annif projects are too tightly tied with the Flask current_app object. This is especially problematic for ensemble backends (and vw_multi) that need to access other projects. They have to do that via current_app. This works fine in the main thread, but not in subprocesses launched by e.g. multiprocessing or joblib.Parallel.

I'm working on decoupling the Annif internals from Flask and will prepare a PR on that soon. I also have a mostly-working implementation on parallel evaluation of documents but it needs a bit more work (and perhaps first some more refactorings to avoid passing around large objects).

osma added the enhancement label Mar 20, 2018

osma added this to the Long term milestone Mar 20, 2018

osma modified the milestones: Long term, 0.48 May 27, 2020

osma self-assigned this May 27, 2020

osma mentioned this issue May 28, 2020

Separate Annif projects from Flask current_app #417

Merged

osma mentioned this issue Jun 5, 2020

Multiprocessing in eval command #418

Merged

4 tasks

osma mentioned this issue Jun 26, 2020

Parallelize optimize command #423

Closed

osma closed this as completed in #418 Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate documents in parallel #65

Evaluate documents in parallel #65

osma commented Mar 20, 2018

osma commented Mar 23, 2018

kinow commented May 12, 2019

osma commented May 13, 2019

osma commented May 27, 2020

Evaluate documents in parallel #65

Evaluate documents in parallel #65

Comments

osma commented Mar 20, 2018

osma commented Mar 23, 2018

kinow commented May 12, 2019

osma commented May 13, 2019

osma commented May 27, 2020