Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing in eval command #418

Merged
merged 17 commits into from
Jun 26, 2020
Merged

Multiprocessing in eval command #418

merged 17 commits into from
Jun 26, 2020

Conversation

osma
Copy link
Member

@osma osma commented Jun 5, 2020

This PR makes it possible to use multiprocessing to speed up the eval command.

The majority of the changes are actually refactorings to decouple the subject index from the SuggestionResult classes. After the changes, SuggestionResult instances no longer keep a reference to SubjectIndex. Instead a SubjectIndex is passed as a parameter to individual SuggestionResult methods as necessary. Since properties cannot take parameters, the hits property has been changed to the as_list method and the vector property has been changed to the as_vector method.

The actual multiprocessing implementation is still very rough and further changes are needed:

  • make it possible to select the number of parallel jobs (e.g. using a --jobs CLI argument)
  • don't use multiprocessing module if jobs=1
  • (try to) optimize VectorSuggestionResult.filter() so it won't return large vectors which cause serialization/deserialization overhead
  • test with different backends to make sure that they perform well (measure CPU time, wall time, peak memory) and don't have race conditions or other similar issues

Fixes #65

@osma osma added this to the 0.48 milestone Jun 5, 2020
@codecov
Copy link

codecov bot commented Jun 5, 2020

Codecov Report

Merging #418 into master will decrease coverage by 0.10%.
The diff coverage is 98.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #418      +/-   ##
==========================================
- Coverage   99.39%   99.29%   -0.11%     
==========================================
  Files          60       61       +1     
  Lines        4309     4371      +62     
==========================================
+ Hits         4283     4340      +57     
- Misses         26       31       +5     
Impacted Files Coverage Δ
annif/backend/dummy.py 100.00% <ø> (ø)
annif/backend/mixins.py 95.12% <0.00%> (ø)
tests/test_eval.py 100.00% <ø> (ø)
annif/datadir.py 84.61% <50.00%> (-15.39%) ⬇️
annif/cli.py 98.76% <91.30%> (-0.81%) ⬇️
annif/__init__.py 88.46% <100.00%> (ø)
annif/backend/ensemble.py 97.72% <100.00%> (+0.29%) ⬆️
annif/backend/fasttext.py 97.77% <100.00%> (ø)
annif/backend/http.py 98.11% <100.00%> (ø)
annif/backend/maui.py 99.35% <100.00%> (ø)
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 321bf0a...495f4cc. Read the comment docs.

@osma osma force-pushed the issue65-eval-multiprocess branch from 20b5e64 to b70191b Compare June 5, 2020 13:59
@osma
Copy link
Member Author

osma commented Jun 5, 2020

Rebased on current master and force-pushed.

@osma
Copy link
Member Author

osma commented Jun 26, 2020

After performing some testing, I'm fairly confident that this feature works with at least most types of backend (tested tfidf, omikuji, fasttext, maui and simple ensemble), but the speedup is not very big - the parallelization overhead is quite significant and the initialization of models and postprocessing of results, which cannot be parallelized, take up significant chunks of time too. In practice, with two parallel jobs, the evaluation takes around the same time as with one job, and to get any improvement in overall evaluation time, you need to use jobs=4 or more.

I changed the default to jobs=1 so that parallel evaluation is only performed when requested by the user.

I will open another issue on implementing a parallel optimize command.

Still some more refactoring, then this can be merged I think.

@sonarcloud
Copy link

sonarcloud bot commented Jun 26, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 10 Security Hotspots to review)
Code Smell A 2 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@lgtm-com
Copy link

lgtm-com bot commented Jun 26, 2020

This pull request fixes 1 alert when merging 495f4cc into 321bf0a - view on LGTM.com

fixed alerts:

  • 1 for Module is imported with 'import' and 'import from'

@osma osma marked this pull request as ready for review June 26, 2020 08:50
@osma osma requested a review from juhoinkinen June 26, 2020 08:50
@osma osma merged commit 8388693 into master Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evaluate documents in parallel
2 participants