Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor SubjectSet and Document to store subject IDs instead of URIs and labels #606

Merged
merged 5 commits into from
Aug 12, 2022

Conversation

osma
Copy link
Member

@osma osma commented Aug 12, 2022

This PR refactors the SubjectSet class, which are used to represent gold standard / manually indexed subjects for training and evaluation documents, so that the SubjectSet class only stores numeric subject IDs instead of subject URIs and labels. It also changes the Document class to contain a SubjectSet instead of separate fields for subject URIs and labels. The end result is that internally, numeric IDs are used much more than before and the conversion from concept URIs and/or labels to subject IDs is performed earlier than before. The changes should simplify things overall and also likely improve efficiency (both RAM and CPU), although I haven't measured the difference.

This PR is very similar to PR #604 which did the same kind of overhaul for the SubjectSuggestion class.

@codecov
Copy link

codecov bot commented Aug 12, 2022

Codecov Report

Merging #606 (9f152c7) into master (391ed7c) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #606      +/-   ##
==========================================
- Coverage   99.55%   99.55%   -0.01%     
==========================================
  Files          86       86              
  Lines        5673     5663      -10     
==========================================
- Hits         5648     5638      -10     
  Misses         25       25              
Impacted Files Coverage Δ
annif/backend/pav.py 98.88% <ø> (-0.02%) ⬇️
annif/corpus/combine.py 100.00% <ø> (ø)
annif/project.py 99.38% <ø> (-0.01%) ⬇️
tests/test_eval.py 100.00% <ø> (ø)
annif/backend/dummy.py 100.00% <100.00%> (ø)
annif/backend/ensemble.py 100.00% <100.00%> (ø)
annif/backend/fasttext.py 100.00% <100.00%> (ø)
annif/backend/mllm.py 100.00% <100.00%> (ø)
annif/backend/nn_ensemble.py 99.29% <100.00%> (-0.01%) ⬇️
annif/backend/omikuji.py 97.46% <100.00%> (-0.10%) ⬇️
... and 20 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@sonarcloud
Copy link

sonarcloud bot commented Aug 12, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@osma osma marked this pull request as ready for review August 12, 2022 09:12
@osma osma requested a review from juhoinkinen August 12, 2022 09:13
@osma osma changed the title Refactor SubjectSet to store subject IDs instead of uris and labels Refactor SubjectSet and Document to store subject IDs instead of URIs and labels Aug 12, 2022
@osma osma self-assigned this Aug 12, 2022
@osma osma added this to the 0.59 milestone Aug 12, 2022
@osma osma merged commit 1c2e849 into master Aug 12, 2022
@osma osma deleted the refactor-subjectset-subject-id branch August 12, 2022 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants