-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add language detection to REST API #659
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #659 +/- ##
=======================================
Coverage 99.64% 99.65%
=======================================
Files 91 93 +2
Lines 6831 6889 +58
=======================================
+ Hits 6807 6865 +58
Misses 24 24 ☔ View full report in Codecov by Sentry. |
I see this is only a draft at the moment, but I took a glance and I think it would be better that the end-point name had hyphen instead of underscore ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good start. I gave some comments on individual code lines. In addition to those:
- black formatting should be applied (see details)
- there should be a unit test in tests/test_rest.py which exercises the detect_language method
Thanks for adding tests. A few more things:
|
Right now, when making a request with no candidates or unknown candidates, the endpoint returns an empty list and when making a request with no text, it returns I also tested making a request with all 48 possible language candidates. I had about 4 GB of free memory which was used amost completely after making the request. The memory isn't released automatically afterwards but it is freed if the endpoint is accessed again (simplemma is run again) or annif is restarted. Making other requests also slows down a lot after runnig simplemma with all candidates. |
The good news is that it's not crashing! 😁 My opinions on these cases:
There should be unit tests to check that these are indeed the results.
I think this is OK, but there should also be a unit test for this special case.
Great, thanks for testing! This is mostly what I suspected, although it's a surprise that accessing the endpoint again will free the memory. (Maybe this has to do with Flask running in development mode?) This has some potential for DoS situations (intended or not), but I guess it's hard for us to avoid that given how Simplemma works. We could, however, limit the number of candidate languages per request to, say, at most 5. What do others think? @juhoinkinen ? We could also try to work with the Simplemma maintainer if we want to change the way Simplemma allocates and releases models. For example, it could be possible to ask Simplemma to release the memory immediately or after a set period like 60 seconds after use. |
Limiting the number of candidate languages seems reasonable. If there is no simple way to make the limit configurable, 5 could be a good number for that.
I noticed there is an issue in Simplemma repository about loading models to memory, which was opened just yesterday. |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Just started to think, if some some testing could be performed also in tests/test_swagger.py. I don't remember just what more functionality does the tests in |
Background to the question of |
34c2538
to
1cd8003
Compare
Rebased on PR #724, adapted to use annif.simplemma_util and Connexion 3 support, force-pushed. |
1cd8003
to
7ccbbf0
Compare
7ccbbf0
to
065bfeb
Compare
… name to languages
c37ee41
to
36b479a
Compare
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this with some quick ways, works nicely!
This PR adds the ability to detect the language of a text to the REST API. The language detection uses the simplemma python library.
A POST method is added to the end-point
/detect-language
. It expects the request body to include a json object with the text whose language is to be detected and a list of candidate languages as their IETF BCP 47ISO 639-1 codes. For example:The response is a json object with the format:
where the scores range from 0 to 1 and a
null
value is used for an unknown language.Implements REST API part of #631.
@juhoinkinen: I edited this for the latest changes (parameter name change candidates -> languages).