-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use set as container of uris instead of list in DocumentFile #510
Use set as container of uris instead of list in DocumentFile #510
Conversation
By just changing the list to set in DocumentFile a test for SVC fails occasionally (about half the time), because then in training the "arkeologit" subject is not necessarily taken as the target subject when there are multiple subjects, but a random one is taken. Before the "arkeologit" subject was always the target subject as it is first of the subjects in the training file, e.g. in here. Increasing the number of requested subjects to 50 helps to ensure "arkeologit" is always one of the suggested subjects (even 40 is not enough), but I wonder if there would be some better way in this? One possibility would be to tweak the SVC training to work with many target subjects instead of taking a random one by making as many copies of the text as there are uris with something like: for uri in doc.uris:
texts.append(doc.text)
classes.append(uri) |
That's a bit unfortunate...in part it stems from using a test corpus which is not really intended for multiclass classification in the SVC unit tests. The fix you suggested is possible but does it really help? It would change the way SVC works, perhaps not a lot, but it would affect the results in a paper on Libris DDC classification I'm writing :) Not that it matters much since this change would be in a specific release, and the paper can specify the version of Annif that was used. Can you make a PR implementing this change (perhaps just adding to this one?), then I could check that it won't adversely affect the results I'm getting with SVC on the Libris-DDC data set? |
A simple option would be to change the suggest test to use some other input text that always gives a predictable subject. |
That's a great idea - do you have anything specific in mind? Maybe there is some subject in the training set that appears alone in some document. It wouldn't hurt to have another test corpus for SVC and similar multiclass algorithms, but that's a bit more work... |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Codecov Report
@@ Coverage Diff @@
## master #510 +/- ##
=======================================
Coverage 99.51% 99.51%
=======================================
Files 82 82
Lines 5771 5809 +38
=======================================
+ Hits 5743 5781 +38
Misses 28 28
Continue to review full report at Codecov.
|
I thought it would have been trivial to produce an input text that for which SVC always gives the same subject, but it was not... A document about zikkuratit is the only document with the "zikkuratit" subject, but to make the unit test pass reliably the input text had contain only "zikkuratit", adding basically anything else gave many other subject suggestions and sometimes the "zikkurarit" subject was missing. However using a short text including "arkeologia" and getting 20 suggestions seems to work. (There is a similar indeterminacy issue with stwfsa suggest test that I've encountered ~5 times, but I think it's not related to any recent changes, it's been appearing for some time already.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So much effort for such a small change! Looks good to me.
Makes the type of the uris container in DocumentFile and DocumentDirectory the same, i.e. set.
Closes #502.