Use set as container of uris instead of list in DocumentFile #510

juhoinkinen · 2021-08-12T12:30:47Z

Makes the type of the uris container in DocumentFile and DocumentDirectory the same, i.e. set.

Closes #502.

juhoinkinen · 2021-08-12T13:02:35Z

By just changing the list to set in DocumentFile a test for SVC fails occasionally (about half the time), because then in training the "arkeologit" subject is not necessarily taken as the target subject when there are multiple subjects, but a random one is taken. Before the "arkeologit" subject was always the target subject as it is first of the subjects in the training file, e.g. in here.

Increasing the number of requested subjects to 50 helps to ensure "arkeologit" is always one of the suggested subjects (even 40 is not enough), but I wonder if there would be some better way in this?

One possibility would be to tweak the SVC training to work with many target subjects instead of taking a random one by making as many copies of the text as there are uris with something like:

for uri in doc.uris:
     texts.append(doc.text)
     classes.append(uri)

osma · 2021-08-12T13:14:49Z

That's a bit unfortunate...in part it stems from using a test corpus which is not really intended for multiclass classification in the SVC unit tests.

The fix you suggested is possible but does it really help? It would change the way SVC works, perhaps not a lot, but it would affect the results in a paper on Libris DDC classification I'm writing :) Not that it matters much since this change would be in a specific release, and the paper can specify the version of Annif that was used.

Can you make a PR implementing this change (perhaps just adding to this one?), then I could check that it won't adversely affect the results I'm getting with SVC on the Libris-DDC data set?

juhoinkinen · 2021-08-12T13:17:50Z

A simple option would be to change the suggest test to use some other input text that always gives a predictable subject.

osma · 2021-08-12T16:47:39Z

That's a great idea - do you have anything specific in mind? Maybe there is some subject in the training set that appears alone in some document.

It wouldn't hurt to have another test corpus for SVC and similar multiclass algorithms, but that's a bit more work...

sonarcloud · 2021-08-12T19:17:40Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

codecov · 2021-08-12T19:21:28Z

Codecov Report

Merging #510 (11dfc63) into master (02111ca) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #510   +/-   ##
=======================================
  Coverage   99.51%   99.51%           
=======================================
  Files          82       82           
  Lines        5771     5809   +38     
=======================================
+ Hits         5743     5781   +38     
  Misses         28       28

Impacted Files	Coverage Δ
annif/corpus/document.py	`100.00% <100.00%> (ø)`
tests/test_backend_svc.py	`100.00% <100.00%> (ø)`
annif/backend/nn_ensemble.py	`99.40% <0.00%> (-0.60%)`	⬇️
annif/backend/stwfsa.py	`100.00% <0.00%> (+1.56%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02111ca...11dfc63. Read the comment docs.

juhoinkinen · 2021-08-12T19:33:24Z

I thought it would have been trivial to produce an input text that for which SVC always gives the same subject, but it was not... A document about zikkuratit is the only document with the "zikkuratit" subject, but to make the unit test pass reliably the input text had contain only "zikkuratit", adding basically anything else gave many other subject suggestions and sometimes the "zikkurarit" subject was missing. However using a short text including "arkeologia" and getting 20 suggestions seems to work.

(There is a similar indeterminacy issue with stwfsa suggest test that I've encountered ~5 times, but I think it's not related to any recent changes, it's been appearing for some time already.)

osma

So much effort for such a small change! Looks good to me.

Use set as container of uris instead of list in DocumentFile

bb1ad33

juhoinkinen added the bug label Aug 12, 2021

juhoinkinen added this to the 0.54 milestone Aug 12, 2021

juhoinkinen requested a review from osma August 12, 2021 13:02

Increase probability of correct suggestion in SVC suggest test

11dfc63

juhoinkinen marked this pull request as ready for review August 12, 2021 19:34

osma approved these changes Aug 13, 2021

View reviewed changes

juhoinkinen merged commit 5efde6b into master Aug 13, 2021

juhoinkinen deleted the issue502-discrepancy-in-types-of-document-subjects branch August 13, 2021 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use set as container of uris instead of list in DocumentFile #510

Use set as container of uris instead of list in DocumentFile #510

juhoinkinen commented Aug 12, 2021 •

edited

Loading

juhoinkinen commented Aug 12, 2021

osma commented Aug 12, 2021

juhoinkinen commented Aug 12, 2021

osma commented Aug 12, 2021

sonarcloud bot commented Aug 12, 2021

codecov bot commented Aug 12, 2021 •

edited

Loading

juhoinkinen commented Aug 12, 2021

osma left a comment

Use set as container of uris instead of list in DocumentFile #510

Use set as container of uris instead of list in DocumentFile #510

Conversation

juhoinkinen commented Aug 12, 2021 • edited Loading

juhoinkinen commented Aug 12, 2021

osma commented Aug 12, 2021

juhoinkinen commented Aug 12, 2021

osma commented Aug 12, 2021

sonarcloud bot commented Aug 12, 2021

codecov bot commented Aug 12, 2021 • edited Loading

Codecov Report

juhoinkinen commented Aug 12, 2021

osma left a comment

Choose a reason for hiding this comment

juhoinkinen commented Aug 12, 2021 •

edited

Loading

codecov bot commented Aug 12, 2021 •

edited

Loading