Multiprocessing in eval command #418

osma · 2020-06-05T11:09:01Z

This PR makes it possible to use multiprocessing to speed up the eval command.

The majority of the changes are actually refactorings to decouple the subject index from the SuggestionResult classes. After the changes, SuggestionResult instances no longer keep a reference to SubjectIndex. Instead a SubjectIndex is passed as a parameter to individual SuggestionResult methods as necessary. Since properties cannot take parameters, the hits property has been changed to the as_list method and the vector property has been changed to the as_vector method.

The actual multiprocessing implementation is still very rough and further changes are needed:

make it possible to select the number of parallel jobs (e.g. using a --jobs CLI argument)
don't use multiprocessing module if jobs=1
(try to) optimize VectorSuggestionResult.filter() so it won't return large vectors which cause serialization/deserialization overhead
test with different backends to make sure that they perform well (measure CPU time, wall time, peak memory) and don't have race conditions or other similar issues

Fixes #65

codecov · 2020-06-05T11:09:44Z

Codecov Report

Merging #418 into master will decrease coverage by 0.10%.
The diff coverage is 98.25%.

@@            Coverage Diff             @@
##           master     #418      +/-   ##
==========================================
- Coverage   99.39%   99.29%   -0.11%     
==========================================
  Files          60       61       +1     
  Lines        4309     4371      +62     
==========================================
+ Hits         4283     4340      +57     
- Misses         26       31       +5

Impacted Files	Coverage Δ
annif/backend/dummy.py	`100.00% <ø> (ø)`
annif/backend/mixins.py	`95.12% <0.00%> (ø)`
tests/test_eval.py	`100.00% <ø> (ø)`
annif/datadir.py	`84.61% <50.00%> (-15.39%)`	⬇️
annif/cli.py	`98.76% <91.30%> (-0.81%)`	⬇️
annif/__init__.py	`88.46% <100.00%> (ø)`
annif/backend/ensemble.py	`97.72% <100.00%> (+0.29%)`	⬆️
annif/backend/fasttext.py	`97.77% <100.00%> (ø)`
annif/backend/http.py	`98.11% <100.00%> (ø)`
annif/backend/maui.py	`99.35% <100.00%> (ø)`
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 321bf0a...495f4cc. Read the comment docs.

…es a SubjectIndex

… takes a SubjectIndex

…lasses

osma · 2020-06-05T13:59:22Z

Rebased on current master and force-pushed.

ListSuggestionResult, so less data to serialize/deserialize during multiprocessing

osma · 2020-06-26T08:03:08Z

After performing some testing, I'm fairly confident that this feature works with at least most types of backend (tested tfidf, omikuji, fasttext, maui and simple ensemble), but the speedup is not very big - the parallelization overhead is quite significant and the initialization of models and postprocessing of results, which cannot be parallelized, take up significant chunks of time too. In practice, with two parallel jobs, the evaluation takes around the same time as with one job, and to get any improvement in overall evaluation time, you need to use jobs=4 or more.

I changed the default to jobs=1 so that parallel evaluation is only performed when requested by the user.

I will open another issue on implementing a parallel optimize command.

Still some more refactoring, then this can be merged I think.

sonarcloud · 2020-06-26T08:23:52Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities (and 10 Security Hotspots to review)
2 Code Smells

No Coverage information
0.0% Duplication

lgtm-com · 2020-06-26T08:45:42Z

This pull request fixes 1 alert when merging 495f4cc into 321bf0a - view on LGTM.com

fixed alerts:

1 for Module is imported with 'import' and 'import from'

osma added the enhancement label Jun 5, 2020

osma added this to the 0.48 milestone Jun 5, 2020

osma added 10 commits June 5, 2020 16:55

Refactor: pass subject_index as parameter to SuggestionResult.filter

a6ec0f9

Refactor: replace SuggestionResult.hits with .as_list method that tak…

419e2c5

…es a SubjectIndex

Refactor: replace SuggestionResult.vector with .as_vector method that…

49e7d72

… takes a SubjectIndex

Refactor: don't store unnecessary subject_index in SuggestionResult c…

e31d0f6

…lasses

first rough implementation of multiprocessing in eval command

f255976

remove unused parameter

dfc7f3d

Implement --jobs parameter for eval CLI command

743b152

Avoid creating subprocesses for eval command with jobs=1

3ca4901

fix test failures caused by merging recent changes from master

f44ab74

Add unit tests for different numbers of eval jobs

b70191b

osma force-pushed the issue65-eval-multiprocess branch from 20b5e64 to b70191b Compare June 5, 2020 13:59

osma added 6 commits June 5, 2020 17:11

Optimization: VectorSuggestionResult.filter now returns a

19d3707

ListSuggestionResult, so less data to serialize/deserialize during multiprocessing

Initialize the project (load model just once) before parallel eval

7e80068

Initializing an ensemble backend initializes the source projects as well

c545465

fix project creation/initialization sequence in AnnifRegistry

0ff3388

Merge branch 'master' into issue65-eval-multiprocess

7ea54ea

Change default to jobs=1: run eval in parallel only if requested by user

b8bf18b

Refactor: split off AnnifRegistry to a separate annif.registry module

495f4cc

osma mentioned this pull request Jun 26, 2020

Parallelize optimize command #423

Closed

osma marked this pull request as ready for review June 26, 2020 08:50

osma requested a review from juhoinkinen June 26, 2020 08:50

juhoinkinen approved these changes Jun 26, 2020

View reviewed changes

osma merged commit 8388693 into master Jun 26, 2020

osma deleted the issue65-eval-multiprocess branch June 26, 2020 09:23

osma mentioned this pull request Jun 26, 2020

Rename variable that would otherwise shadow the builtin function map #425

Merged

juhoinkinen mentioned this pull request Nov 24, 2020

Parallelized eval of nn_ensemble projects hangs #453

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessing in eval command #418

Multiprocessing in eval command #418

osma commented Jun 5, 2020 •

edited

Loading

codecov bot commented Jun 5, 2020 •

edited

Loading

osma commented Jun 5, 2020

osma commented Jun 26, 2020

sonarcloud bot commented Jun 26, 2020

lgtm-com bot commented Jun 26, 2020

Multiprocessing in eval command #418

Multiprocessing in eval command #418

Conversation

osma commented Jun 5, 2020 • edited Loading

codecov bot commented Jun 5, 2020 • edited Loading

Codecov Report

osma commented Jun 5, 2020

osma commented Jun 26, 2020

sonarcloud bot commented Jun 26, 2020

lgtm-com bot commented Jun 26, 2020

osma commented Jun 5, 2020 •

edited

Loading

codecov bot commented Jun 5, 2020 •

edited

Loading