Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexical STWFSAPY Backend #438

Merged
merged 18 commits into from
Jan 26, 2021
Merged

Lexical STWFSAPY Backend #438

merged 18 commits into from
Jan 26, 2021

Conversation

mo-fu
Copy link
Contributor

@mo-fu mo-fu commented Aug 21, 2020

Here is the first shot at adding the lexical backend developed at ZBW to Annif. Please suggest changes and missing tests.

*The algorithm works best for the English language or languages that are not morphologically rich.
*Currently it works only with short texts. My current understanding is that this is due to the lack of global features, i.e, every match by the finite state automaton is transformed into features but does not take into account other matches. If there are many matches in a text, all of them will get reported and scored individually. This will drive down precision.
*I tested the algorithm by using the tutorial data sets. I removed the small sample of titles from the larger one to have a training and test set. In the YSO case I also needed to manually edit the ontology because one label of https://finto.fi/yso/en/page/p37741 is malformated. This breaks the automaton construction.

I used the following config:

[stwfsapy-yso] 
name=STWFSAPY YSO 
language=en
backend=stwfsapy
vocab=yso  
concept_type_uri=http://www.w3.org/2004/02/skos/core#Concept
sub_thesaurus_type_uri=http://www.w3.org/2004/02/skos/core#Collection
thesaurus_relation_type_uri=http://www.w3.org/2004/02/skos/core#member 
thesaurus_relation_is_specialisation=True
simple_english_plural_rules=True
graph_path=/home/fuer/Annif-tutorial/data-sets/yso-nlf/yso-skos.rdf

And got the following results:

Metric Value
Precision (doc avg): 0.3539953809523809
Recall (doc avg): 0.15510352654022555
F1 score (doc avg): 0.19221541052157337
Precision (subj avg): 0.14931225045169869
Recall (subj avg): 0.07734734983512383
F1 score (subj avg): 0.08407773848545441
Precision (weighted subj avg): 0.45994831973792344
Recall (weighted subj avg): 0.12177820581040807
F1 score (weighted subj avg): 0.16307901726109417
Precision (microavg): 0.42555643541955374
Recall (microavg): 0.12177820581040807
F1 score (microavg): 0.1893667795628981
F1@5: 0.19213065080346786
NDCG: 0.1993514066034938
NDCG@5: 0.22009559912085086
NDCG@10: 0.20175143141259658
Precision@1: 0.39159
Precision@3: 0.35656666666666664
Precision@5: 0.354183
LRAP: 0.1399838756074559
True positives: 60189
False positives: 81247
False negatives: 434062
Documents evaluated: 100000

It is a little better for the ZBW data set (document average F1 ~0.25). For our internal data set I got an improvement from 0.18 to 0.25 document average F1 when compared to the default configuration of Maui Server

@osma
Copy link
Member

osma commented Aug 21, 2020

Thanks for the PR!

The eval results are a bit hard to read, any chance you could reformat the block, for example by indenting it?

Also I see that stwfsapy is on PyPI, but there is no dependency declared in setup.py. Am I correct that the implementation is pure Python i.e. doesn't require any compiled extensions? In that case, I think it could be added as a core dependency, not an optional one.

@mo-fu
Copy link
Contributor Author

mo-fu commented Aug 21, 2020

Thanks for the PR!

The eval results are a bit hard to read, any chance you could reformat the block, for example by indenting it?

Also I see that stwfsapy is on PyPI, but there is no dependency declared in setup.py. Am I correct that the implementation is pure Python i.e. doesn't require any compiled extensions? In that case, I think it could be added as a core dependency, not an optional one.

Just forgot the setup.py.
Yes, it is a pure python implementation. I can change it to a default backend if you want.
The original post should also have better readability now.

@osma osma added this to the Short term milestone Aug 21, 2020
@osma
Copy link
Member

osma commented Aug 21, 2020

*I tested the algorithm by using the tutorial data sets. I removed the small sample of titles from the larger one to have a training and test set. In the YSO case I also needed to manually edit the ontology because one label of https://finto.fi/yso/en/page/p37741 is malformated. This breaks the automaton construction.

Good catch, this has now been fixed in YSO.

Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good as a first shot. The dependency should be added to setup.py - currently Travis tests are failing due to the missing dependency.

I gave a few specific comments on the implementation.

@@ -1119,4 +1119,140 @@
<skos:altLabel xml:lang="sv">sigillvetenskap</skos:altLabel>
<skos:prefLabel xml:lang="sv">sigillografi</skos:prefLabel>
</skos:Concept>
<isothes:ConceptGroup rdf:about="http://www.yso.fi/onto/yso/p26593">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is having the concepts inside a ConceptGroup/Collection a requirement for using stwfsapy? Or done just to improve the results?

If it's a requirement, I'm a bit worried about the consequences. Many SKOS vocabularies don't have this kind of structure, although both STW and YSO do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it is mandatory. But I can change that. Don't know how important this is for predictive performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed this in the library but will need to update it on pypi. Will do so on monday, after testing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need to introduce this ConceptGroup into yso-archaeology.rdf for the rest of the PR to work, or can this change now be dropped?

annif/backend/stwfsapy.py Outdated Show resolved Hide resolved
annif/backend/stwfsapy.py Outdated Show resolved Hide resolved
@mo-fu
Copy link
Contributor Author

mo-fu commented Aug 21, 2020

Looks very good as a first shot. The dependency should be added to setup.py - currently Travis tests are failing due to the missing dependency.

I gave a few specific comments on the implementation.

Regarding the CI. I can make it a default dependency or add it to the CI depending on the python version and use import_or_skip like the other optional backends. Your decision.

@sonarcloud
Copy link

sonarcloud bot commented Aug 24, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@codecov
Copy link

codecov bot commented Aug 24, 2020

Codecov Report

Merging #438 (fc18969) into master (98bde23) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #438      +/-   ##
==========================================
+ Coverage   99.41%   99.43%   +0.01%     
==========================================
  Files          65       67       +2     
  Lines        4627     4777     +150     
==========================================
+ Hits         4600     4750     +150     
  Misses         27       27              
Impacted Files Coverage Δ
annif/backend/__init__.py 100.00% <100.00%> (ø)
annif/backend/stwfsa.py 100.00% <100.00%> (ø)
annif/vocab.py 100.00% <100.00%> (ø)
tests/test_backend_stwfsa.py 100.00% <100.00%> (ø)
tests/test_vocab.py 100.00% <100.00%> (ø)
tests/test_corpus.py 100.00% <0.00%> (ø)
annif/backend/maui.py 100.00% <0.00%> (ø)
tests/test_backend_maui.py 100.00% <0.00%> (ø)
annif/backend/nn_ensemble.py 100.00% <0.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 98bde23...fc18969. Read the comment docs.

@juhoinkinen
Copy link
Member

I did some test runs with this backend, evaluating on kirjaesittelyt set (book presentations, 130 short texts like this) while training on kirjaesittelyt, JYU theses or Finna metadata sets.

Here are the F1@5 scores:

train set F1@5
kirjaesittelyt 0.1697
JYU theses 0.1536
Finna 0.1714

All these scores are better than for Maui (0.1472) that has been trained on full-text documents from different collections.

@osma
Copy link
Member

osma commented Nov 2, 2020

Apologies @mo-fu for taking so long to review this. As @juhoinkinen already noted, this is giving promising results - it appears to be somewhat better than Maui on short documents, as you said.

Some quick findings:

rdflib version compatibility

stwfsapy seems to require rdflib 4.2.*, while Annif currently doesn't depend on a specific version. In practice when you install Annif from scratch, you will most likely end up with rdflib 5.0.0 which is the most recent version currently. This causes a version conflict if you then try to install stwfsapy and needs to be resolved manually (at least for me) by downgrading rdflib.

I think the best option here would be for both stwfsapy and Annif to depend on the most recent version (e.g. 5.0.*). In my experience upgrading to rdflib 5.0.0 shouldn't cause any major compatibility issues for most projects that use rdflib (though I did have to do some fixes in Skosify as it did some special tricks with namespaces).

Fragile handling of vocabulary terms

At least with YSO (until we fixed it), and also with a snapshot version of the NAL Thesaurus I tried, some special characters (esp. unmatched braces) seem to be causing parse errors in stwfsapy. Would it be possible to make it more robust?

Matching all upper case letters

I happened to notice a regular expression for matching upper case characters in the stwfsapy code base. It hardcodes the set as [A-ZÄÖÜ]. This seems a bit limited, since there are many other upper case letters as well (e.g. Å in Swedish and Õ in Estonian). I know stwfsapy currently targets only English, but these characters may appear in names (and you already include ÄÖÜ). Unfortunately the standard library re module is a bit limited, but it would be possible to use the regex module from PyPI (for this an other solutions, see this SO thread)

Backend name

We talked about this already - I'd hope to see a shorter name for the backend, as stwfsapy is a bit unwieldy (think about recording a tutorial video about it and having to pronounce it many times!). Maybe the backend could have a shorter name - just shortening it to stwfsa would already be an improvement.

@osma
Copy link
Member

osma commented Nov 2, 2020

Ah, forgot this one:

Regarding the CI. I can make it a default dependency or add it to the CI depending on the python version and use import_or_skip like the other optional backends. Your decision.

Since this backend and stwfsapy only introduce pure Python dependencies, and in my understanding they support all the Python versions Annif currently supports (3.6-3.8), I think adding it as a default dependency would be OK.

@mo-fu
Copy link
Contributor Author

mo-fu commented Nov 5, 2020

I can probably work on the issues starting next week, when our annual review has started. Assuming it goes smoothly.

@mo-fu
Copy link
Contributor Author

mo-fu commented Nov 17, 2020

PR for more robust bracket handling: zbw/stwfsapy#27 I could add only ZBW organization members as reviewers, but please have a look.

@mo-fu
Copy link
Contributor Author

mo-fu commented Jan 5, 2021

Next PR for stwfsapy with changes for better upper case detection is here: zbw/stwfsapy#29 Please request any changes necessary for adding stwfsa to Annif.

@mo-fu
Copy link
Contributor Author

mo-fu commented Jan 7, 2021

After changing the rdflib dependency version (zbw/stwfsapy@a654264 ) I renamed the backend to stwfsa. There is a conflict in setup.py.
Do you prefer merges or rebases for resolving conflicts?

@osma
Copy link
Member

osma commented Jan 7, 2021

I renamed the backend to stwfsa.

Great!

Do you prefer merges or rebases for resolving conflicts?

Whatever is easiest for you...there were some dependency updates recently, which is probably the cause for the conflict.

@lgtm-com
Copy link

lgtm-com bot commented Jan 7, 2021

This pull request introduces 1 alert when merging b588090 into 98bde23 - view on LGTM.com

new alerts:

  • 1 for Implicit string concatenation in a list

@mo-fu
Copy link
Contributor Author

mo-fu commented Jan 13, 2021

The requested changes should all be done now. Let me know, if I missed something or further changes are needed.

@osma
Copy link
Member

osma commented Jan 15, 2021

Thanks @mo-fu , I'll take a look soon!

One thing that certainly needs to be done is to create a wiki page for the new backend. Like the other similar pages it should briefly explain what the backend does, how to set it up, and what the configuration parameters are.

...which brings up the issue of config parameters. The backend seems to take quite a few parameters, based on the config example you gave above. Are all of them necessary? Are there any default values? I think it would be helpful if most of the parameters had sane defaults that work with a typical, simple SKOS thesaurus.

You may want to take a look at the new MLLM backend, which aims to become a replacement for Maui. I just opened a draft PR #462 with some initial code. There's a tiny bit of overlap there with this PR since MLLM also needs the as_graph functionality - I simply copied that code from this PR. This may cause a merge conflict down the line but it should be trivial to resolve - don't worry about it now.

Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick check of the code and gave some comments. I think this is getting really close to merging, basically we just need to work out the best default values (and corresponding documentation) plus a few other small details.

@@ -1119,4 +1119,140 @@
<skos:altLabel xml:lang="sv">sigillvetenskap</skos:altLabel>
<skos:prefLabel xml:lang="sv">sigillografi</skos:prefLabel>
</skos:Concept>
<isothes:ConceptGroup rdf:about="http://www.yso.fi/onto/yso/p26593">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need to introduce this ConceptGroup into yso-archaeology.rdf for the rest of the PR to work, or can this change now be dropped?

annif/backend/stwfsa.py Show resolved Hide resolved
annif/backend/stwfsa.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
annif/backend/stwfsa.py Show resolved Hide resolved
@mo-fu mo-fu requested a review from osma January 25, 2021 12:30
@osma
Copy link
Member

osma commented Jan 25, 2021

A couple more, pretty minor issues:

  1. Code Climate has some complaints about the PR. You can ignore the duplicate code warnings (the fasttext backend has similar code for handling defaults, but that's OK), but would it be easy to reduce the cyclomatic complexity of the _train method? For example, the transforming of the corpus to the X and y arrays could perhaps be extracted to a separate (static?) helper method or function, and/or introducing the graph variable could be avoided by just calling as_graph inline.
  2. I see that you've improved the wiki page for the backend, great! Now that the concept type defaults to skos:Concept, I think that those settings can be dropped from the example configurations?

@mo-fu
Copy link
Contributor Author

mo-fu commented Jan 25, 2021

A couple more, pretty minor issues:

1. Code Climate has [some complaints](https://codeclimate.com/github/NatLibFi/Annif/pull/438) about the PR. You can ignore the duplicate code warnings (the fasttext backend has similar code for handling defaults, but that's OK), but would it be easy to reduce the cyclomatic complexity of the `_train` method? For example, the transforming of the corpus to the `X` and `y` arrays could perhaps be extracted to a separate (static?) helper method or function, and/or introducing the `graph` variable could be avoided by just calling `as_graph` inline.

Should be fixed now

2. I see that you've improved the [wiki page for the backend](https://github.com/NatLibFi/Annif/wiki/Backend:-STWFSA), great! Now that the concept type defaults to `skos:Concept`, I think that those settings can be dropped from the example configurations?

I kept it for the second configuration example and added some more details to the description of the second example.

@sonarcloud
Copy link

sonarcloud bot commented Jan 25, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@osma osma modified the milestones: Short term, 0.51 Jan 26, 2021
@osma
Copy link
Member

osma commented Jan 26, 2021

Excellent! Merging this now. 🚀

@osma osma merged commit 034a4ae into NatLibFi:master Jan 26, 2021
@osma
Copy link
Member

osma commented Jan 26, 2021

Since this is now merged, I added links to the backend wiki page from the wiki front page and the sidebar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants