Fix training SVC on fulltext corpus #501

juhoinkinen · 2021-07-01T06:32:15Z

Claudia reported in annif-users email list that they encountered problems when trying to train the SVC backend on full-text corpus. Their initial problem was not the one that this PR fixes, but their report brought this up.

Training SVC backend on fulltext corpus does not work but fails with TypeError: 'set' object is not subscriptable. This is because DocumentDirectory defines uris for documents as a set, while in SVC it was assumed that uris are list or other subscriptable (which for DocumentFile is true).

This PR simply makes sure the uris are a list. If a document has multiple uris a NotSupportedException is raised, because a set is not ordered so a random uri from the ones defined for the training document would be taken (if there are many).

To make SVC training unit tests work I made a clumsy fixture document_corpus_single_subject that uses the regular document_corpus fixture but includes only one subject for each document. This could surely be improved in some way.

I will also make an issue for the discrepancy of uri container types (set from from DocumentDirectory vs list from DocumentFile).

codecov · 2021-07-01T06:32:45Z

Codecov Report

Merging #501 (efced0a) into master (2b8e3c9) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #501   +/-   ##
=======================================
  Coverage   99.48%   99.49%           
=======================================
  Files          78       78           
  Lines        5672     5687   +15     
=======================================
+ Hits         5643     5658   +15     
  Misses         29       29

Impacted Files	Coverage Δ
annif/backend/svc.py	`100.00% <100.00%> (ø)`
tests/conftest.py	`100.00% <100.00%> (ø)`
tests/test_backend_svc.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b8e3c9...efced0a. Read the comment docs.

sonarcloud · 2021-07-01T06:32:53Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

juhoinkinen added 3 commits June 29, 2021 22:50

Convert set of uris from DocumentDirectory to list of uris

3fb6176

Raise NotSupportedException for training SVC on docs with many subjects

38d7aec

Fixture of corpus of docs with one subject only for SVC backend training

efced0a

juhoinkinen added the bug label Jul 1, 2021

juhoinkinen added this to the 0.54 milestone Jul 1, 2021

juhoinkinen merged commit ec8762d into master Jul 1, 2021

juhoinkinen mentioned this pull request Jul 1, 2021

Discrepancy in types of document subjects #502

Closed

juhoinkinen mentioned this pull request Aug 10, 2021

Warn instead of error in case of multiple subjects per doc in SVC training #509

Merged

osma deleted the fix-training-svc-on-fulltext-corpus branch August 19, 2021 14:23

juhoinkinen modified the milestones: 0.54, 0.53 Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix training SVC on fulltext corpus #501

Fix training SVC on fulltext corpus #501

juhoinkinen commented Jul 1, 2021 •

edited

Loading

codecov bot commented Jul 1, 2021 •

edited

Loading

sonarcloud bot commented Jul 1, 2021

Fix training SVC on fulltext corpus #501

Fix training SVC on fulltext corpus #501

Conversation

juhoinkinen commented Jul 1, 2021 • edited Loading

codecov bot commented Jul 1, 2021 • edited Loading

Codecov Report

sonarcloud bot commented Jul 1, 2021

juhoinkinen commented Jul 1, 2021 •

edited

Loading

codecov bot commented Jul 1, 2021 •

edited

Loading