-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit document number CLI option #465
Conversation
Codecov Report
@@ Coverage Diff @@
## master #465 +/- ##
==========================================
+ Coverage 99.41% 99.44% +0.02%
==========================================
Files 65 67 +2
Lines 4631 4850 +219
==========================================
+ Hits 4604 4823 +219
Misses 27 27
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unix command line options are generally either short (with one dash) or long (with two dashes). Short options are just one character. I suggest changing -dl
to -d
so it follows this convention. No Annif command currently uses a -d
option so it's available.
Other than that, the implementation looks very nice and good for merging!
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
In learning-curve runs I noticed that training nn-ensemble model on zero or one documents fails:
But that crash comes from TensorFlow, and also training on I don't remember why but the nn-ensemble was meant to be trainable on an empty corpus (otherwise checking the corpus by |
The idea here was IIRC that you could train on an empty corpus so you would end up with an NN ensemble model that is essentially equivalent to a plain ensemble - it just does averaging, with no adjustment of scores. Then you could use |
This adds a CLI option
--docs-limit/-dl
--docs-limit/-d
to the commands where it is applicable. The option can be used to limit the number of documents to process to create learning-curve data, for example. Learning curves can help to estimate "what is enough training data".This CLI option can be used in a shell script with training and evaluation steps, and the complex implementation of #364 can be abandoned.