Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I noticed that the
annif optimize
command is extremely slow for ensemble projects. The problem was that the ensemble backends return VectorSuggestionResult objects, while regular backends (tfidf, omikuji, stwfsa...) usually return ListSuggestionResult objects. Theoptimize
command does a lot of filtering of results (using differentlimit
andthreshold
values) and this is a very slow operation with VectorSuggestionResult.The fix is to ensure that the results given by the project are first converted to ListSuggestionResult. Conveniently,
VectorSuggestionResult.filter
already returns a ListSuggestionResult and in any case, it makes sense to pre-filter the results down to at most 15 suggestions since only the top 15 will be used anyway and keeping the others will just create more work when filtering. However, since it's not guaranteed thatVectorSuggestionResult.filter
will always keep returning a ListSuggestionResult, I added an extraassert
statement to verify this and fail fast instead of working extremely slowly.I tested this using STW thesaurus based projects from the Annif tutorial. I defined a
tfidf
project and anomikuji
project and trained them with thestw-econbiz-small
corpus. Then I defined an ensemble combining both. Here are some benchmark results for the optimize command (targeting the test corpus):This brings a speedup of 12-15% for the regular projects and a whopping 86% for the ensemble project. RAM use is practically unchanged except for a 6% reduction for the ensemble case.