-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YAKE backend #461
YAKE backend #461
Conversation
Codecov Report
@@ Coverage Diff @@
## master #461 +/- ##
========================================
Coverage 99.46% 99.47%
========================================
Files 73 76 +3
Lines 5280 5513 +233
========================================
+ Hits 5252 5484 +232
- Misses 28 29 +1
Continue to review full report at Codecov.
|
4fbdf4c
to
cadabdd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave some comments on some possible improvements. Overall I think this is looking very promising.
How to express the licensing information (GPLv3) needs some more thought. It shouldn't be terribly complicated but needs a different frame of thinking so I mostly just looked at the code now :)
not_matched.append((kp, self._transform_score(score))) | ||
# Remove duplicate uris, conflating the scores | ||
suggestions = self._combine_suggestions(suggestions) | ||
self.debug('Keyphrases not matched:\n' + '\t'.join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future version, I think these non-matched keyphrases should be propagated back to the user as well, but it could be done in a subsequent PR as it requires a lot more scaffolding.
94b5eb5
to
327d3e2
Compare
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
327d3e2
to
0b1cacd
Compare
Rebased & force-pushed |
…raph_project fixture
I updated the PR description and put evaluation results in the results table for comparison with Maui and MLLM. Still to do: a Wiki page for the backend. There are some inline comments/questions in the PR from me. @osma can you take a look at this again? |
A draft Wiki page: https://github.com/NatLibFi/Annif/wiki/Backend:-YAKE Another page edit to do in Wiki: add installation instructions to Optional features and dependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I gave a few suggestions for small changes. You can decide if you want to address them or not, then it's OK to merge this.
I will write a separate comment about the wiki documentation
Regarding the wiki page:
|
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
This PR adds a new backend to Annif by integrating the YAKE library.
YAKE performs unsupervised automatic keyword extraction, and in the Annif backend the keywords found by YAKE are searched from the SKOS vocabulary labels, and the matches are returned as subject suggestions. The search can be targeted to prefLabels, altLabels and/or hiddenLabels as set in project configuration.
The YAKE backend is based on lexical principle, but does not perform as well as the other lexical backends (MLLM, STWFSA or Maui) as measured by evaluation results. However, the (free) keyword extraction operation offers a possibility to add new features in Annif, especially the feature for suggesting new terms for a vocabulary (the keywords not found in the vocabulary), see #224. Also the unsupervised approach can be useful in some cases: there is no need for training data.