-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include labels without language tag and concepts without labels in vocabulary #597
Conversation
Hmm, I think at least the MLLM backend, possibly also YAKE and (less likely) STWFSA will need to be changed so that they don't rely on the label stored in the vocabulary. Otherwise they could be confused by the qnames. |
Codecov Report
@@ Coverage Diff @@
## master #597 +/- ##
==========================================
+ Coverage 99.52% 99.54% +0.01%
==========================================
Files 86 86
Lines 5636 5653 +17
==========================================
+ Hits 5609 5627 +18
+ Misses 27 26 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Adjusted the MLLM code so it reads prefLabels directly. Checked YAKE and STWFSA, they are OK already as they are not using the labels from the vocabulary either. |
Ready for wider testing. Code Climate still has a couple of complaints but I can't figure out how to address them without making the code harder to understand. |
3d3fa1d
to
73176b4
Compare
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Rebased on current master (after the 0.59 release) and force-pushed. |
Fixes #556 by modifying the way concepts from SKOS vocabularies are loaded. There are two main changes:
yso:p12345
orlcsh:sh85061212
)This should improve the support for multilingual vocabularies and handle cases when SKOS data is missing language tags, which can happen for example when converting MARC21 records to SKOS like @macsag did when reporting #556.
Note that unlike the solution drafted in this comment, there is no BCP 47 style matching of language tag variants (e.g.
en
in the SKOS file would match the configured languageen-US
). I considered this out of scope for now (YAGNI principle) although it could easily be added later, but it would require using a library such as langcodes for the actual language tag matching.This PR may change the results for some multilingual corpora, for example the YSO based corpora used to train and evaluate models for Finto AI, because the vocabulary will now be larger in some cases. YSO usually lacks Swedish and/or English language labels for some recently added concepts and these used to be dropped when loading the vocabulary, but will now be included after this PR.