Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use least frequent token as key in TokenSetIndex used by MLLM #518

Merged
merged 1 commit into from
Aug 19, 2021

Conversation

osma
Copy link
Member

@osma osma commented Aug 19, 2021

This PR improves the performance of the MLLM algorithm on large vocabularies with lots of repetition in the labels. The TokenSetIndex keeps the token sets indexed using a key token. Previously this key token was picked randomly, but this PR changes it so that the rarest token (among the terms of the subject vocabulary) is used as the key.

For example, if the vocabulary contains person entities like "John Smith", "John Taylor", "John Doe" and "John Johnson", they could all have been put in the index under the token john. Every time the document text contains the name "John", the index has to be searched and the tokens in the sentence compared with all the above token sets ({john, smith}, {john, taylor}, {john, doe}, {john, johnson}).

Instead the tokensets are now indexed under the rarest token (e.g. smith, taylor, doe, johnson - assuming here that there are no other persons with those surnames!) which is more efficient to search. Only when the text contains e.g. "Taylor", is the sentence compared to the set {john, taylor} (but not the others).

The difference in performance is very small in the case of YSO, where there isn't much repetition, but for GND this seems to reduce processing times by around 2 seconds per document!

@osma osma added this to the 0.54 milestone Aug 19, 2021
@osma osma self-assigned this Aug 19, 2021
@sonarcloud
Copy link

sonarcloud bot commented Aug 19, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@codecov
Copy link

codecov bot commented Aug 19, 2021

Codecov Report

Merging #518 (59a1e8f) into master (7cc8dc8) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #518      +/-   ##
==========================================
+ Coverage   99.51%   99.53%   +0.01%     
==========================================
  Files          82       82              
  Lines        5831     5828       -3     
==========================================
- Hits         5803     5801       -2     
+ Misses         28       27       -1     
Impacted Files Coverage Δ
annif/lexical/mllm.py 100.00% <100.00%> (ø)
annif/lexical/tokenset.py 100.00% <100.00%> (ø)
tests/test_lexical_tokenset.py 100.00% <100.00%> (ø)
annif/backend/stwfsa.py 100.00% <0.00%> (+1.56%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7cc8dc8...59a1e8f. Read the comment docs.

@osma osma merged commit b17448e into master Aug 19, 2021
@osma osma deleted the feature-mllm-index-key-by-token-freq branch August 19, 2021 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant