Use least frequent token as key in TokenSetIndex used by MLLM #518
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the performance of the MLLM algorithm on large vocabularies with lots of repetition in the labels. The TokenSetIndex keeps the token sets indexed using a key token. Previously this key token was picked randomly, but this PR changes it so that the rarest token (among the terms of the subject vocabulary) is used as the key.
For example, if the vocabulary contains person entities like "John Smith", "John Taylor", "John Doe" and "John Johnson", they could all have been put in the index under the token
john
. Every time the document text contains the name "John", the index has to be searched and the tokens in the sentence compared with all the above token sets ({john, smith}, {john, taylor}, {john, doe}, {john, johnson}
).Instead the tokensets are now indexed under the rarest token (e.g.
smith
,taylor
,doe
,johnson
- assuming here that there are no other persons with those surnames!) which is more efficient to search. Only when the text contains e.g. "Taylor", is the sentence compared to the set{john, taylor}
(but not the others).The difference in performance is very small in the case of YSO, where there isn't much repetition, but for GND this seems to reduce processing times by around 2 seconds per document!