Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the optimize command #477

Merged
merged 1 commit into from
Mar 23, 2021
Merged

Optimize the optimize command #477

merged 1 commit into from
Mar 23, 2021

Conversation

osma
Copy link
Member

@osma osma commented Mar 22, 2021

I noticed that the annif optimize command is extremely slow for ensemble projects. The problem was that the ensemble backends return VectorSuggestionResult objects, while regular backends (tfidf, omikuji, stwfsa...) usually return ListSuggestionResult objects. The optimize command does a lot of filtering of results (using different limit and threshold values) and this is a very slow operation with VectorSuggestionResult.

The fix is to ensure that the results given by the project are first converted to ListSuggestionResult. Conveniently, VectorSuggestionResult.filter already returns a ListSuggestionResult and in any case, it makes sense to pre-filter the results down to at most 15 suggestions since only the top 15 will be used anyway and keeping the others will just create more work when filtering. However, since it's not guaranteed that VectorSuggestionResult.filter will always keep returning a ListSuggestionResult, I added an extra assert statement to verify this and fail fast instead of working extremely slowly.

I tested this using STW thesaurus based projects from the Annif tutorial. I defined a tfidf project and an omikuji project and trained them with the stw-econbiz-small corpus. Then I defined an ensemble combining both. Here are some benchmark results for the optimize command (targeting the test corpus):

Backend User time before User time after  RAM before RAM after
tfidf 445s 377s 559544 557428
omikuji 452s 398s 630988 626408
ensemble 3129s 426s 733684 691048

This brings a speedup of 12-15% for the regular projects and a whopping 86% for the ensemble project. RAM use is practically unchanged except for a 6% reduction for the ensemble case.

@osma osma added the bug label Mar 22, 2021
@osma osma added this to the 0.52 milestone Mar 22, 2021
@osma osma requested a review from juhoinkinen March 22, 2021 14:09
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Member

@juhoinkinen juhoinkinen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@osma osma merged commit 7f42c96 into master Mar 23, 2021
@osma osma deleted the fix-optimize-optimize-command branch March 23, 2021 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants