-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lexical STWFSAPY Backend #438
Conversation
Thanks for the PR! The eval results are a bit hard to read, any chance you could reformat the block, for example by indenting it? Also I see that stwfsapy is on PyPI, but there is no dependency declared in setup.py. Am I correct that the implementation is pure Python i.e. doesn't require any compiled extensions? In that case, I think it could be added as a core dependency, not an optional one. |
Just forgot the setup.py. |
Good catch, this has now been fixed in YSO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good as a first shot. The dependency should be added to setup.py - currently Travis tests are failing due to the missing dependency.
I gave a few specific comments on the implementation.
@@ -1119,4 +1119,140 @@ | |||
<skos:altLabel xml:lang="sv">sigillvetenskap</skos:altLabel> | |||
<skos:prefLabel xml:lang="sv">sigillografi</skos:prefLabel> | |||
</skos:Concept> | |||
<isothes:ConceptGroup rdf:about="http://www.yso.fi/onto/yso/p26593"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is having the concepts inside a ConceptGroup/Collection a requirement for using stwfsapy? Or done just to improve the results?
If it's a requirement, I'm a bit worried about the consequences. Many SKOS vocabularies don't have this kind of structure, although both STW and YSO do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it is mandatory. But I can change that. Don't know how important this is for predictive performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed this in the library but will need to update it on pypi. Will do so on monday, after testing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need to introduce this ConceptGroup into yso-archaeology.rdf for the rest of the PR to work, or can this change now be dropped?
Regarding the CI. I can make it a default dependency or add it to the CI depending on the python version and use |
Kudos, SonarCloud Quality Gate passed! 0 Bugs |
Codecov Report
@@ Coverage Diff @@
## master #438 +/- ##
==========================================
+ Coverage 99.41% 99.43% +0.01%
==========================================
Files 65 67 +2
Lines 4627 4777 +150
==========================================
+ Hits 4600 4750 +150
Misses 27 27
Continue to review full report at Codecov.
|
I did some test runs with this backend, evaluating on kirjaesittelyt set (book presentations, 130 short texts like this) while training on kirjaesittelyt, JYU theses or Finna metadata sets. Here are the F1@5 scores:
All these scores are better than for Maui (0.1472) that has been trained on full-text documents from different collections. |
Apologies @mo-fu for taking so long to review this. As @juhoinkinen already noted, this is giving promising results - it appears to be somewhat better than Maui on short documents, as you said. Some quick findings: rdflib version compatibilitystwfsapy seems to require rdflib 4.2.*, while Annif currently doesn't depend on a specific version. In practice when you install Annif from scratch, you will most likely end up with rdflib 5.0.0 which is the most recent version currently. This causes a version conflict if you then try to install stwfsapy and needs to be resolved manually (at least for me) by downgrading rdflib. I think the best option here would be for both stwfsapy and Annif to depend on the most recent version (e.g. 5.0.*). In my experience upgrading to rdflib 5.0.0 shouldn't cause any major compatibility issues for most projects that use rdflib (though I did have to do some fixes in Skosify as it did some special tricks with namespaces). Fragile handling of vocabulary termsAt least with YSO (until we fixed it), and also with a snapshot version of the NAL Thesaurus I tried, some special characters (esp. unmatched braces) seem to be causing parse errors in stwfsapy. Would it be possible to make it more robust? Matching all upper case lettersI happened to notice a regular expression for matching upper case characters in the stwfsapy code base. It hardcodes the set as Backend nameWe talked about this already - I'd hope to see a shorter name for the backend, as |
Ah, forgot this one:
Since this backend and stwfsapy only introduce pure Python dependencies, and in my understanding they support all the Python versions Annif currently supports (3.6-3.8), I think adding it as a default dependency would be OK. |
I can probably work on the issues starting next week, when our annual review has started. Assuming it goes smoothly. |
PR for more robust bracket handling: zbw/stwfsapy#27 I could add only ZBW organization members as reviewers, but please have a look. |
Next PR for stwfsapy with changes for better upper case detection is here: zbw/stwfsapy#29 Please request any changes necessary for adding stwfsa to Annif. |
After changing the rdflib dependency version (zbw/stwfsapy@a654264 ) I renamed the backend to stwfsa. There is a conflict in |
Great!
Whatever is easiest for you...there were some dependency updates recently, which is probably the cause for the conflict. |
This pull request introduces 1 alert when merging b588090 into 98bde23 - view on LGTM.com new alerts:
|
The requested changes should all be done now. Let me know, if I missed something or further changes are needed. |
Thanks @mo-fu , I'll take a look soon! One thing that certainly needs to be done is to create a wiki page for the new backend. Like the other similar pages it should briefly explain what the backend does, how to set it up, and what the configuration parameters are. ...which brings up the issue of config parameters. The backend seems to take quite a few parameters, based on the config example you gave above. Are all of them necessary? Are there any default values? I think it would be helpful if most of the parameters had sane defaults that work with a typical, simple SKOS thesaurus. You may want to take a look at the new MLLM backend, which aims to become a replacement for Maui. I just opened a draft PR #462 with some initial code. There's a tiny bit of overlap there with this PR since MLLM also needs the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick check of the code and gave some comments. I think this is getting really close to merging, basically we just need to work out the best default values (and corresponding documentation) plus a few other small details.
@@ -1119,4 +1119,140 @@ | |||
<skos:altLabel xml:lang="sv">sigillvetenskap</skos:altLabel> | |||
<skos:prefLabel xml:lang="sv">sigillografi</skos:prefLabel> | |||
</skos:Concept> | |||
<isothes:ConceptGroup rdf:about="http://www.yso.fi/onto/yso/p26593"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need to introduce this ConceptGroup into yso-archaeology.rdf for the rest of the PR to work, or can this change now be dropped?
A couple more, pretty minor issues:
|
Should be fixed now
I kept it for the second configuration example and added some more details to the description of the second example. |
Kudos, SonarCloud Quality Gate passed! |
Excellent! Merging this now. 🚀 |
Since this is now merged, I added links to the backend wiki page from the wiki front page and the sidebar. |
Here is the first shot at adding the lexical backend developed at ZBW to Annif. Please suggest changes and missing tests.
*The algorithm works best for the English language or languages that are not morphologically rich.
*Currently it works only with short texts. My current understanding is that this is due to the lack of global features, i.e, every match by the finite state automaton is transformed into features but does not take into account other matches. If there are many matches in a text, all of them will get reported and scored individually. This will drive down precision.
*I tested the algorithm by using the tutorial data sets. I removed the small sample of titles from the larger one to have a training and test set. In the YSO case I also needed to manually edit the ontology because one label of https://finto.fi/yso/en/page/p37741 is malformated. This breaks the automaton construction.
I used the following config:
And got the following results:
It is a little better for the ZBW data set (document average F1 ~0.25). For our internal data set I got an improvement from 0.18 to 0.25 document average F1 when compared to the default configuration of Maui Server