Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix training SVC on fulltext corpus #501

Merged
merged 3 commits into from
Jul 1, 2021

Conversation

juhoinkinen
Copy link
Member

@juhoinkinen juhoinkinen commented Jul 1, 2021

Claudia reported in annif-users email list that they encountered problems when trying to train the SVC backend on full-text corpus. Their initial problem was not the one that this PR fixes, but their report brought this up.

Training SVC backend on fulltext corpus does not work but fails with TypeError: 'set' object is not subscriptable. This is because DocumentDirectory defines uris for documents as a set, while in SVC it was assumed that uris are list or other subscriptable (which for DocumentFile is true).

This PR simply makes sure the uris are a list. If a document has multiple uris a NotSupportedException is raised, because a set is not ordered so a random uri from the ones defined for the training document would be taken (if there are many).

To make SVC training unit tests work I made a clumsy fixture document_corpus_single_subject that uses the regular document_corpus fixture but includes only one subject for each document. This could surely be improved in some way.

I will also make an issue for the discrepancy of uri container types (set from from DocumentDirectory vs list from DocumentFile).

@juhoinkinen juhoinkinen added the bug label Jul 1, 2021
@juhoinkinen juhoinkinen added this to the 0.54 milestone Jul 1, 2021
@codecov
Copy link

codecov bot commented Jul 1, 2021

Codecov Report

Merging #501 (efced0a) into master (2b8e3c9) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #501   +/-   ##
=======================================
  Coverage   99.48%   99.49%           
=======================================
  Files          78       78           
  Lines        5672     5687   +15     
=======================================
+ Hits         5643     5658   +15     
  Misses         29       29           
Impacted Files Coverage Δ
annif/backend/svc.py 100.00% <100.00%> (ø)
tests/conftest.py 100.00% <100.00%> (ø)
tests/test_backend_svc.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b8e3c9...efced0a. Read the comment docs.

@sonarcloud
Copy link

sonarcloud bot commented Jul 1, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@juhoinkinen juhoinkinen merged commit ec8762d into master Jul 1, 2021
@osma osma deleted the fix-training-svc-on-fulltext-corpus branch August 19, 2021 14:23
@juhoinkinen juhoinkinen modified the milestones: 0.54, 0.53 Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant