Lexical STWFSAPY Backend #438

mo-fu · 2020-08-21T08:20:16Z

Here is the first shot at adding the lexical backend developed at ZBW to Annif. Please suggest changes and missing tests.

*The algorithm works best for the English language or languages that are not morphologically rich.
*Currently it works only with short texts. My current understanding is that this is due to the lack of global features, i.e, every match by the finite state automaton is transformed into features but does not take into account other matches. If there are many matches in a text, all of them will get reported and scored individually. This will drive down precision.
*I tested the algorithm by using the tutorial data sets. I removed the small sample of titles from the larger one to have a training and test set. In the YSO case I also needed to manually edit the ontology because one label of https://finto.fi/yso/en/page/p37741 is malformated. This breaks the automaton construction.

I used the following config:

[stwfsapy-yso] 
name=STWFSAPY YSO 
language=en
backend=stwfsapy
vocab=yso  
concept_type_uri=http://www.w3.org/2004/02/skos/core#Concept
sub_thesaurus_type_uri=http://www.w3.org/2004/02/skos/core#Collection
thesaurus_relation_type_uri=http://www.w3.org/2004/02/skos/core#member 
thesaurus_relation_is_specialisation=True
simple_english_plural_rules=True
graph_path=/home/fuer/Annif-tutorial/data-sets/yso-nlf/yso-skos.rdf

And got the following results:

Metric	Value
Precision (doc avg):	0.3539953809523809
Recall (doc avg):	0.15510352654022555
F1 score (doc avg):	0.19221541052157337
Precision (subj avg):	0.14931225045169869
Recall (subj avg):	0.07734734983512383
F1 score (subj avg):	0.08407773848545441
Precision (weighted subj avg):	0.45994831973792344
Recall (weighted subj avg):	0.12177820581040807
F1 score (weighted subj avg):	0.16307901726109417
Precision (microavg):	0.42555643541955374
Recall (microavg):	0.12177820581040807
F1 score (microavg):	0.1893667795628981
F1@5:	0.19213065080346786
NDCG:	0.1993514066034938
NDCG@5:	0.22009559912085086
NDCG@10:	0.20175143141259658
Precision@1:	0.39159
Precision@3:	0.35656666666666664
Precision@5:	0.354183
LRAP:	0.1399838756074559
True positives:	60189
False positives:	81247
False negatives:	434062
Documents evaluated:	100000

It is a little better for the ZBW data set (document average F1 ~0.25). For our internal data set I got an improvement from 0.18 to 0.25 document average F1 when compared to the default configuration of Maui Server

osma · 2020-08-21T08:27:37Z

Thanks for the PR!

The eval results are a bit hard to read, any chance you could reformat the block, for example by indenting it?

Also I see that stwfsapy is on PyPI, but there is no dependency declared in setup.py. Am I correct that the implementation is pure Python i.e. doesn't require any compiled extensions? In that case, I think it could be added as a core dependency, not an optional one.

mo-fu · 2020-08-21T08:58:43Z

Thanks for the PR!

The eval results are a bit hard to read, any chance you could reformat the block, for example by indenting it?

Also I see that stwfsapy is on PyPI, but there is no dependency declared in setup.py. Am I correct that the implementation is pure Python i.e. doesn't require any compiled extensions? In that case, I think it could be added as a core dependency, not an optional one.

Just forgot the setup.py.
Yes, it is a pure python implementation. I can change it to a default backend if you want.
The original post should also have better readability now.

osma · 2020-08-21T09:55:03Z

*I tested the algorithm by using the tutorial data sets. I removed the small sample of titles from the larger one to have a training and test set. In the YSO case I also needed to manually edit the ontology because one label of https://finto.fi/yso/en/page/p37741 is malformated. This breaks the automaton construction.

Good catch, this has now been fixed in YSO.

osma

Looks very good as a first shot. The dependency should be added to setup.py - currently Travis tests are failing due to the missing dependency.

I gave a few specific comments on the implementation.

osma · 2020-08-21T10:20:44Z

tests/corpora/archaeology/yso-archaeology.rdf

@@ -1119,4 +1119,140 @@
    <skos:altLabel xml:lang="sv">sigillvetenskap</skos:altLabel>
    <skos:prefLabel xml:lang="sv">sigillografi</skos:prefLabel>
  </skos:Concept>
+    <isothes:ConceptGroup rdf:about="http://www.yso.fi/onto/yso/p26593">


Is having the concepts inside a ConceptGroup/Collection a requirement for using stwfsapy? Or done just to improve the results?

If it's a requirement, I'm a bit worried about the consequences. Many SKOS vocabularies don't have this kind of structure, although both STW and YSO do.

Currently it is mandatory. But I can change that. Don't know how important this is for predictive performance.

I have changed this in the library but will need to update it on pypi. Will do so on monday, after testing it.

Do you still need to introduce this ConceptGroup into yso-archaeology.rdf for the rest of the PR to work, or can this change now be dropped?

annif/backend/stwfsapy.py

mo-fu · 2020-08-21T14:05:39Z

Looks very good as a first shot. The dependency should be added to setup.py - currently Travis tests are failing due to the missing dependency.

I gave a few specific comments on the implementation.

Regarding the CI. I can make it a default dependency or add it to the CI depending on the python version and use import_or_skip like the other optional backends. Your decision.

sonarcloud · 2020-08-24T13:40:07Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities (and 0 Security Hotspots to review)
0 Code Smells

No Coverage information
No Duplication information

codecov · 2020-08-24T13:42:34Z

Codecov Report

Merging #438 (fc18969) into master (98bde23) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #438      +/-   ##
==========================================
+ Coverage   99.41%   99.43%   +0.01%     
==========================================
  Files          65       67       +2     
  Lines        4627     4777     +150     
==========================================
+ Hits         4600     4750     +150     
  Misses         27       27

Impacted Files	Coverage Δ
annif/backend/__init__.py	`100.00% <100.00%> (ø)`
annif/backend/stwfsa.py	`100.00% <100.00%> (ø)`
annif/vocab.py	`100.00% <100.00%> (ø)`
tests/test_backend_stwfsa.py	`100.00% <100.00%> (ø)`
tests/test_vocab.py	`100.00% <100.00%> (ø)`
tests/test_corpus.py	`100.00% <0.00%> (ø)`
annif/backend/maui.py	`100.00% <0.00%> (ø)`
tests/test_backend_maui.py	`100.00% <0.00%> (ø)`
annif/backend/nn_ensemble.py	`100.00% <0.00%> (ø)`
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 98bde23...fc18969. Read the comment docs.

juhoinkinen · 2020-11-02T14:29:56Z

I did some test runs with this backend, evaluating on kirjaesittelyt set (book presentations, 130 short texts like this) while training on kirjaesittelyt, JYU theses or Finna metadata sets.

Here are the F1@5 scores:

train set	F1@5
kirjaesittelyt	0.1697
JYU theses	0.1536
Finna	0.1714

All these scores are better than for Maui (0.1472) that has been trained on full-text documents from different collections.

osma · 2020-11-02T15:04:50Z

Apologies @mo-fu for taking so long to review this. As @juhoinkinen already noted, this is giving promising results - it appears to be somewhat better than Maui on short documents, as you said.

Some quick findings:

rdflib version compatibility

stwfsapy seems to require rdflib 4.2.*, while Annif currently doesn't depend on a specific version. In practice when you install Annif from scratch, you will most likely end up with rdflib 5.0.0 which is the most recent version currently. This causes a version conflict if you then try to install stwfsapy and needs to be resolved manually (at least for me) by downgrading rdflib.

I think the best option here would be for both stwfsapy and Annif to depend on the most recent version (e.g. 5.0.*). In my experience upgrading to rdflib 5.0.0 shouldn't cause any major compatibility issues for most projects that use rdflib (though I did have to do some fixes in Skosify as it did some special tricks with namespaces).

Fragile handling of vocabulary terms

At least with YSO (until we fixed it), and also with a snapshot version of the NAL Thesaurus I tried, some special characters (esp. unmatched braces) seem to be causing parse errors in stwfsapy. Would it be possible to make it more robust?

Matching all upper case letters

I happened to notice a regular expression for matching upper case characters in the stwfsapy code base. It hardcodes the set as [A-ZÄÖÜ]. This seems a bit limited, since there are many other upper case letters as well (e.g. Å in Swedish and Õ in Estonian). I know stwfsapy currently targets only English, but these characters may appear in names (and you already include ÄÖÜ). Unfortunately the standard library re module is a bit limited, but it would be possible to use the regex module from PyPI (for this an other solutions, see this SO thread)

Backend name

We talked about this already - I'd hope to see a shorter name for the backend, as stwfsapy is a bit unwieldy (think about recording a tutorial video about it and having to pronounce it many times!). Maybe the backend could have a shorter name - just shortening it to stwfsa would already be an improvement.

osma · 2020-11-02T15:15:10Z

Ah, forgot this one:

Regarding the CI. I can make it a default dependency or add it to the CI depending on the python version and use import_or_skip like the other optional backends. Your decision.

Since this backend and stwfsapy only introduce pure Python dependencies, and in my understanding they support all the Python versions Annif currently supports (3.6-3.8), I think adding it as a default dependency would be OK.

mo-fu · 2020-11-05T08:42:08Z

I can probably work on the issues starting next week, when our annual review has started. Assuming it goes smoothly.

mo-fu · 2020-11-17T15:29:51Z

PR for more robust bracket handling: zbw/stwfsapy#27 I could add only ZBW organization members as reviewers, but please have a look.

mo-fu · 2021-01-05T08:56:32Z

Next PR for stwfsapy with changes for better upper case detection is here: zbw/stwfsapy#29 Please request any changes necessary for adding stwfsa to Annif.

mo-fu · 2021-01-07T14:03:40Z

After changing the rdflib dependency version (zbw/stwfsapy@a654264 ) I renamed the backend to stwfsa. There is a conflict in setup.py.
Do you prefer merges or rebases for resolving conflicts?

osma · 2021-01-07T14:05:55Z

I renamed the backend to stwfsa.

Great!

Do you prefer merges or rebases for resolving conflicts?

Whatever is easiest for you...there were some dependency updates recently, which is probably the cause for the conflict.

…twfsapy backend.

…s relation.

lgtm-com · 2021-01-07T14:49:03Z

This pull request introduces 1 alert when merging b588090 into 98bde23 - view on LGTM.com

new alerts:

1 for Implicit string concatenation in a list

mo-fu · 2021-01-13T08:20:41Z

The requested changes should all be done now. Let me know, if I missed something or further changes are needed.

osma · 2021-01-15T08:38:20Z

Thanks @mo-fu , I'll take a look soon!

One thing that certainly needs to be done is to create a wiki page for the new backend. Like the other similar pages it should briefly explain what the backend does, how to set it up, and what the configuration parameters are.

...which brings up the issue of config parameters. The backend seems to take quite a few parameters, based on the config example you gave above. Are all of them necessary? Are there any default values? I think it would be helpful if most of the parameters had sane defaults that work with a typical, simple SKOS thesaurus.

You may want to take a look at the new MLLM backend, which aims to become a replacement for Maui. I just opened a draft PR #462 with some initial code. There's a tiny bit of overlap there with this PR since MLLM also needs the as_graph functionality - I simply copied that code from this PR. This may cause a merge conflict down the line but it should be trivial to resolve - don't worry about it now.

osma

I did a quick check of the code and gave some comments. I think this is getting really close to merging, basically we just need to work out the best default values (and corresponding documentation) plus a few other small details.

osma · 2021-01-25T10:01:31Z

tests/corpora/archaeology/yso-archaeology.rdf

@@ -1119,4 +1119,140 @@
    <skos:altLabel xml:lang="sv">sigillvetenskap</skos:altLabel>
    <skos:prefLabel xml:lang="sv">sigillografi</skos:prefLabel>
  </skos:Concept>
+    <isothes:ConceptGroup rdf:about="http://www.yso.fi/onto/yso/p26593">


Do you still need to introduce this ConceptGroup into yso-archaeology.rdf for the rest of the PR to work, or can this change now be dropped?

annif/backend/stwfsa.py

setup.py

annif/backend/stwfsa.py

…SA backend.

osma · 2021-01-25T15:56:20Z

A couple more, pretty minor issues:

Code Climate has some complaints about the PR. You can ignore the duplicate code warnings (the fasttext backend has similar code for handling defaults, but that's OK), but would it be easy to reduce the cyclomatic complexity of the _train method? For example, the transforming of the corpus to the X and y arrays could perhaps be extracted to a separate (static?) helper method or function, and/or introducing the graph variable could be avoided by just calling as_graph inline.
I see that you've improved the wiki page for the backend, great! Now that the concept type defaults to skos:Concept, I think that those settings can be dropped from the example configurations?

mo-fu · 2021-01-25T16:55:51Z

A couple more, pretty minor issues:

1. Code Climate has [some complaints](https://codeclimate.com/github/NatLibFi/Annif/pull/438) about the PR. You can ignore the duplicate code warnings (the fasttext backend has similar code for handling defaults, but that's OK), but would it be easy to reduce the cyclomatic complexity of the `_train` method? For example, the transforming of the corpus to the `X` and `y` arrays could perhaps be extracted to a separate (static?) helper method or function, and/or introducing the `graph` variable could be avoided by just calling `as_graph` inline.

Should be fixed now

2. I see that you've improved the [wiki page for the backend](https://github.com/NatLibFi/Annif/wiki/Backend:-STWFSA), great! Now that the concept type defaults to `skos:Concept`, I think that those settings can be dropped from the example configurations?

I kept it for the second configuration example and added some more details to the description of the second example.

sonarcloud · 2021-01-25T16:57:51Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

osma · 2021-01-26T07:29:19Z

Excellent! Merging this now. 🚀

osma · 2021-01-26T15:45:10Z

Since this is now merged, I added links to the backend wiki page from the wiki front page and the sidebar.

osma added the enhancement label Aug 21, 2020

osma added this to the Short term milestone Aug 21, 2020

osma reviewed Aug 21, 2020

View reviewed changes

annif/backend/stwfsapy.py Outdated Show resolved Hide resolved

mo-fu added 9 commits January 7, 2021 15:20

Added group membership to test thesaurus.

5c1e9ca

Added stwfsapy backend.

662f01f

Add STWFSAPY as dependency to setup.py

4a97213

Add as_graph method to vocabulary.

2b0bb2c

Use internal graph in stwfsapy backend.

84539a8

Perform only single iteration through document corpus when training s…

176888f

…twfsapy backend.

Further specify test for as_graph method of vocab.

c498676

Add default arguments forstwfsapy backend thesaurus type and thesauru…

b93e221

…s relation.

Make stwfsapy a default backend.

b299772

mo-fu force-pushed the master branch from 40c8562 to b588090 Compare January 7, 2021 14:22

Renamed stwfaspy backend to stwfsa.

82a6d1c

mo-fu force-pushed the master branch from b588090 to 82a6d1c Compare January 7, 2021 15:07

Add input_limit parameter to stwfsa backend.

783cea8

osma requested changes Jan 25, 2021

View reviewed changes

mo-fu added 6 commits January 25, 2021 11:41

Better defaults for concepts, thesauri and their relation in the STWF…

f1b24ed

…SA backend.

Remove group info from archaeology test vocabulary.

85ed020

Use atomic_save in STWFSA backend.

2f93653

Add version constraint to rdflib.

920305c

Add test for uninitialized STWFSA backend.

3c400a8

Remove blank lines in STWFSA test file.

0e1e881

mo-fu requested a review from osma January 25, 2021 12:30

Try to reduce cyclomatic complexity in stwfsa backend.

fc18969

mo-fu force-pushed the master branch from 694560d to fc18969 Compare January 25, 2021 16:56

osma approved these changes Jan 26, 2021

View reviewed changes

osma modified the milestones: Short term, 0.51 Jan 26, 2021

osma merged commit 034a4ae into NatLibFi:master Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexical STWFSAPY Backend #438

Lexical STWFSAPY Backend #438

mo-fu commented Aug 21, 2020 •

edited

Loading

osma commented Aug 21, 2020

mo-fu commented Aug 21, 2020

osma commented Aug 21, 2020

osma left a comment

osma Aug 21, 2020

mo-fu Aug 21, 2020

mo-fu Aug 21, 2020

osma Jan 25, 2021

mo-fu commented Aug 21, 2020

sonarcloud bot commented Aug 24, 2020

codecov bot commented Aug 24, 2020 •

edited

Loading

juhoinkinen commented Nov 2, 2020

osma commented Nov 2, 2020 •

edited

Loading

osma commented Nov 2, 2020

mo-fu commented Nov 5, 2020

mo-fu commented Nov 17, 2020 •

edited

Loading

mo-fu commented Jan 5, 2021

mo-fu commented Jan 7, 2021

osma commented Jan 7, 2021

lgtm-com bot commented Jan 7, 2021

mo-fu commented Jan 13, 2021

osma commented Jan 15, 2021

osma left a comment

osma Jan 25, 2021

osma commented Jan 25, 2021

mo-fu commented Jan 25, 2021

sonarcloud bot commented Jan 25, 2021

osma commented Jan 26, 2021

osma commented Jan 26, 2021

Lexical STWFSAPY Backend #438

Lexical STWFSAPY Backend #438

Conversation

mo-fu commented Aug 21, 2020 • edited Loading

osma commented Aug 21, 2020

mo-fu commented Aug 21, 2020

osma commented Aug 21, 2020

osma left a comment

Choose a reason for hiding this comment

osma Aug 21, 2020

Choose a reason for hiding this comment

mo-fu Aug 21, 2020

Choose a reason for hiding this comment

mo-fu Aug 21, 2020

Choose a reason for hiding this comment

osma Jan 25, 2021

Choose a reason for hiding this comment

mo-fu commented Aug 21, 2020

sonarcloud bot commented Aug 24, 2020

codecov bot commented Aug 24, 2020 • edited Loading

Codecov Report

juhoinkinen commented Nov 2, 2020

osma commented Nov 2, 2020 • edited Loading

rdflib version compatibility

Fragile handling of vocabulary terms

Matching all upper case letters

Backend name

osma commented Nov 2, 2020

mo-fu commented Nov 5, 2020

mo-fu commented Nov 17, 2020 • edited Loading

mo-fu commented Jan 5, 2021

mo-fu commented Jan 7, 2021

osma commented Jan 7, 2021

lgtm-com bot commented Jan 7, 2021

mo-fu commented Jan 13, 2021

osma commented Jan 15, 2021

osma left a comment

Choose a reason for hiding this comment

osma Jan 25, 2021

Choose a reason for hiding this comment

osma commented Jan 25, 2021

mo-fu commented Jan 25, 2021

sonarcloud bot commented Jan 25, 2021

osma commented Jan 26, 2021

osma commented Jan 26, 2021

mo-fu commented Aug 21, 2020 •

edited

Loading

codecov bot commented Aug 24, 2020 •

edited

Loading

osma commented Nov 2, 2020 •

edited

Loading

mo-fu commented Nov 17, 2020 •

edited

Loading