Refactor SubjectSet and Document to store subject IDs instead of URIs and labels #606

osma · 2022-08-12T07:59:47Z

This PR refactors the SubjectSet class, which are used to represent gold standard / manually indexed subjects for training and evaluation documents, so that the SubjectSet class only stores numeric subject IDs instead of subject URIs and labels. It also changes the Document class to contain a SubjectSet instead of separate fields for subject URIs and labels. The end result is that internally, numeric IDs are used much more than before and the conversion from concept URIs and/or labels to subject IDs is performed earlier than before. The changes should simplify things overall and also likely improve efficiency (both RAM and CPU), although I haven't measured the difference.

This PR is very similar to PR #604 which did the same kind of overhaul for the SubjectSuggestion class.

codecov · 2022-08-12T08:04:36Z

Codecov Report

Merging #606 (9f152c7) into master (391ed7c) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #606      +/-   ##
==========================================
- Coverage   99.55%   99.55%   -0.01%     
==========================================
  Files          86       86              
  Lines        5673     5663      -10     
==========================================
- Hits         5648     5638      -10     
  Misses         25       25

Impacted Files	Coverage Δ
annif/backend/pav.py	`98.88% <ø> (-0.02%)`	⬇️
annif/corpus/combine.py	`100.00% <ø> (ø)`
annif/project.py	`99.38% <ø> (-0.01%)`	⬇️
tests/test_eval.py	`100.00% <ø> (ø)`
annif/backend/dummy.py	`100.00% <100.00%> (ø)`
annif/backend/ensemble.py	`100.00% <100.00%> (ø)`
annif/backend/fasttext.py	`100.00% <100.00%> (ø)`
annif/backend/mllm.py	`100.00% <100.00%> (ø)`
annif/backend/nn_ensemble.py	`99.29% <100.00%> (-0.01%)`	⬇️
annif/backend/omikuji.py	`97.46% <100.00%> (-0.10%)`	⬇️
... and 20 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

sonarcloud · 2022-08-12T08:32:40Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

osma added 2 commits August 12, 2022 10:57

Refactor SubjectSet to store subject_ids instead of uris and labels

d19a9a2

remove unused method set_subject_index

da3e398

osma added 3 commits August 12, 2022 11:06

remove unused method _uris_to_subj_ids

542b1f9

add more tests for SubjectSet nonequality

60aaa0e

remove unused parameter vocab

9f152c7

osma marked this pull request as ready for review August 12, 2022 09:12

osma requested a review from juhoinkinen August 12, 2022 09:13

osma changed the title ~~Refactor SubjectSet to store subject IDs instead of uris and labels~~ Refactor SubjectSet and Document to store subject IDs instead of URIs and labels Aug 12, 2022

osma self-assigned this Aug 12, 2022

osma added the maintenance label Aug 12, 2022

osma added this to the 0.59 milestone Aug 12, 2022

juhoinkinen approved these changes Aug 12, 2022

View reviewed changes

osma merged commit 1c2e849 into master Aug 12, 2022

osma deleted the refactor-subjectset-subject-id branch August 12, 2022 10:49

osma mentioned this pull request Aug 15, 2022

multilingual SubjectIndex backed by CSV file #608

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor SubjectSet and Document to store subject IDs instead of URIs and labels #606

Refactor SubjectSet and Document to store subject IDs instead of URIs and labels #606

osma commented Aug 12, 2022 •

edited

Loading

codecov bot commented Aug 12, 2022 •

edited

Loading

sonarcloud bot commented Aug 12, 2022

Refactor SubjectSet and Document to store subject IDs instead of URIs and labels #606

Refactor SubjectSet and Document to store subject IDs instead of URIs and labels #606

Conversation

osma commented Aug 12, 2022 • edited Loading

codecov bot commented Aug 12, 2022 • edited Loading

Codecov Report

sonarcloud bot commented Aug 12, 2022

osma commented Aug 12, 2022 •

edited

Loading

codecov bot commented Aug 12, 2022 •

edited

Loading