Use least frequent token as key in TokenSetIndex used by MLLM #518

osma · 2021-08-19T11:12:45Z

This PR improves the performance of the MLLM algorithm on large vocabularies with lots of repetition in the labels. The TokenSetIndex keeps the token sets indexed using a key token. Previously this key token was picked randomly, but this PR changes it so that the rarest token (among the terms of the subject vocabulary) is used as the key.

For example, if the vocabulary contains person entities like "John Smith", "John Taylor", "John Doe" and "John Johnson", they could all have been put in the index under the token john. Every time the document text contains the name "John", the index has to be searched and the tokens in the sentence compared with all the above token sets ({john, smith}, {john, taylor}, {john, doe}, {john, johnson}).

Instead the tokensets are now indexed under the rarest token (e.g. smith, taylor, doe, johnson - assuming here that there are no other persons with those surnames!) which is more efficient to search. Only when the text contains e.g. "Taylor", is the sentence compared to the set {john, taylor} (but not the others).

The difference in performance is very small in the case of YSO, where there isn't much repetition, but for GND this seems to reduce processing times by around 2 seconds per document!

…d by MLLM

sonarcloud · 2021-08-19T11:13:17Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

codecov · 2021-08-19T11:17:10Z

Codecov Report

Merging #518 (59a1e8f) into master (7cc8dc8) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #518      +/-   ##
==========================================
+ Coverage   99.51%   99.53%   +0.01%     
==========================================
  Files          82       82              
  Lines        5831     5828       -3     
==========================================
- Hits         5803     5801       -2     
+ Misses         28       27       -1

Impacted Files	Coverage Δ
annif/lexical/mllm.py	`100.00% <100.00%> (ø)`
annif/lexical/tokenset.py	`100.00% <100.00%> (ø)`
tests/test_lexical_tokenset.py	`100.00% <100.00%> (ø)`
annif/backend/stwfsa.py	`100.00% <0.00%> (+1.56%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7cc8dc8...59a1e8f. Read the comment docs.

Scalability fix: Use least frequent token as key in TokenSetIndex use…

59a1e8f

…d by MLLM

osma added the enhancement label Aug 19, 2021

osma added this to the 0.54 milestone Aug 19, 2021

osma self-assigned this Aug 19, 2021

osma merged commit b17448e into master Aug 19, 2021

osma deleted the feature-mllm-index-key-by-token-freq branch August 19, 2021 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use least frequent token as key in TokenSetIndex used by MLLM #518

Use least frequent token as key in TokenSetIndex used by MLLM #518

osma commented Aug 19, 2021

sonarcloud bot commented Aug 19, 2021

codecov bot commented Aug 19, 2021 •

edited

Loading

Use least frequent token as key in TokenSetIndex used by MLLM #518

Use least frequent token as key in TokenSetIndex used by MLLM #518

Conversation

osma commented Aug 19, 2021

sonarcloud bot commented Aug 19, 2021

codecov bot commented Aug 19, 2021 • edited Loading

Codecov Report

codecov bot commented Aug 19, 2021 •

edited

Loading