Fix training SVC on fulltext corpus #501
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Claudia reported in annif-users email list that they encountered problems when trying to train the SVC backend on full-text corpus. Their initial problem was not the one that this PR fixes, but their report brought this up.
Training SVC backend on fulltext corpus does not work but fails with
TypeError: 'set' object is not subscriptable
. This is becauseDocumentDirectory
defines uris for documents as a set, while in SVC it was assumed that uris are list or other subscriptable (which forDocumentFile
is true).This PR simply makes sure the uris are a list. If a document has multiple uris a
NotSupportedException
is raised, because a set is not ordered so a random uri from the ones defined for the training document would be taken (if there are many).To make SVC training unit tests work I made a clumsy fixture
document_corpus_single_subject
that uses the regulardocument_corpus
fixture but includes only one subject for each document. This could surely be improved in some way.I will also make an issue for the discrepancy of uri container types (set from from DocumentDirectory vs list from DocumentFile).