Limit document number CLI option #465

juhoinkinen · 2021-02-01T16:16:37Z

This adds a CLI option ~~--docs-limit/-dl~~--docs-limit/-d to the commands where it is applicable. The option can be used to limit the number of documents to process to create learning-curve data, for example. Learning curves can help to estimate "what is enough training data".

This CLI option can be used in a shell script with training and evaluation steps, and the complex implementation of #364 can be abandoned.

codecov · 2021-02-01T16:16:44Z

Codecov Report

Merging #465 (a59efc7) into master (267cc85) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #465      +/-   ##
==========================================
+ Coverage   99.41%   99.44%   +0.02%     
==========================================
  Files          65       67       +2     
  Lines        4631     4850     +219     
==========================================
+ Hits         4604     4823     +219     
  Misses         27       27

Impacted Files	Coverage Δ
annif/corpus/__init__.py	`100.00% <ø> (ø)`
annif/cli.py	`99.61% <100.00%> (+0.01%)`	⬆️
annif/corpus/document.py	`100.00% <100.00%> (ø)`
tests/test_cli.py	`100.00% <100.00%> (ø)`
tests/test_corpus.py	`100.00% <100.00%> (ø)`
annif/vocab.py	`100.00% <0.00%> (ø)`
tests/test_vocab.py	`100.00% <0.00%> (ø)`
annif/backend/__init__.py	`100.00% <0.00%> (ø)`
annif/backend/stwfsa.py	`100.00% <0.00%> (ø)`
tests/test_backend_stwfsa.py	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 267cc85...a59efc7. Read the comment docs.

osma

Unix command line options are generally either short (with one dash) or long (with two dashes). Short options are just one character. I suggest changing -dl to -d so it follows this convention. No Annif command currently uses a -d option so it's available.

Other than that, the implementation looks very nice and good for merging!

sonarcloud · 2021-02-03T09:09:37Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

osma

LGTM!

juhoinkinen · 2021-02-03T10:17:25Z

In learning-curve runs I noticed that training nn-ensemble model on zero or one documents fails:

$ annif train yso-fi ../Annif-corpora/fulltext/jyu-theses/fin-maui-train/ -d 1
Backend nn_ensemble: creating NN ensemble model
2021-02-03 12:04:07.753489: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-03 12:04:07.757431: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 1800000000 Hz
2021-02-03 12:04:07.757935: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b0747e3930 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-02-03 12:04:07.757957: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Epoch 1/20
Traceback (most recent call last):
  File "/home/local/jmminkin/git/Annif/venv/bin/annif", line 33, in <module>
    sys.exit(load_entry_point('annif', 'console_scripts', 'annif')())
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/flask/cli.py", line 586, in main
    return super(FlaskGroup, self).main(*args, **kwargs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/flask/cli.py", line 426, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/local/jmminkin/git/Annif/annif/cli.py", line 211, in run_train
    proj.train(documents, backend_params)
  File "/home/local/jmminkin/git/Annif/annif/project.py", line 191, in train
    self.backend.train(corpus, beparams)
  File "/home/local/jmminkin/git/Annif/annif/backend/backend.py", line 79, in train
    return self._train(corpus, params=beparams)
  File "/home/local/jmminkin/git/Annif/annif/backend/nn_ensemble.py", line 159, in _train
    self._fit_model(corpus, epochs=int(params['epochs']))
  File "/home/local/jmminkin/git/Annif/annif/backend/nn_ensemble.py", line 197, in _fit_model
    self._model.fit(seq, verbose=True, epochs=epochs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1104, in fit
    epoch_logs = copy.copy(logs)
UnboundLocalError: local variable 'logs' referenced before assignment

But that crash comes from TensorFlow, and also training on /dev/null produces it, so it's not due to this PR and could be handled somewhere else (if needed), so I'll merge this.

I don't remember why but the nn-ensemble was meant to be trainable on an empty corpus (otherwise checking the corpus by is_empty could be used).

osma · 2021-02-03T10:46:34Z

I don't remember why but the nn-ensemble was meant to be trainable on an empty corpus (otherwise checking the corpus by is_empty could be used).

The idea here was IIRC that you could train on an empty corpus so you would end up with an NN ensemble model that is essentially equivalent to a plain ensemble - it just does averaging, with no adjustment of scores. Then you could use learn to train it incrementally. But it's possible that this has never actually worked...

juhoinkinen added 3 commits February 1, 2021 17:25

Limit the number of docs to process with LimitingDocumentCorpus

7176eb1

Test for documents of LimitingDocumentCorpus

89231ca

Test for usage of LimitingDocumentCorpus in train

1e5a9e0

juhoinkinen added the enhancement label Feb 1, 2021

juhoinkinen added this to the 0.51 milestone Feb 1, 2021

osma requested changes Feb 2, 2021

View reviewed changes

Change option -dl to -d to follow Unix convention

a59efc7

osma approved these changes Feb 3, 2021

View reviewed changes

juhoinkinen merged commit c2cd230 into master Feb 3, 2021

juhoinkinen deleted the limit-document-number branch February 3, 2021 11:59

juhoinkinen mentioned this pull request Feb 3, 2021

Learning curves #364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit document number CLI option #465

Limit document number CLI option #465

juhoinkinen commented Feb 1, 2021 •

edited

Loading

codecov bot commented Feb 1, 2021 •

edited

Loading

osma left a comment

sonarcloud bot commented Feb 3, 2021

osma left a comment

juhoinkinen commented Feb 3, 2021

osma commented Feb 3, 2021

Limit document number CLI option #465

Limit document number CLI option #465

Conversation

juhoinkinen commented Feb 1, 2021 • edited Loading

codecov bot commented Feb 1, 2021 • edited Loading

Codecov Report

osma left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Feb 3, 2021

osma left a comment

Choose a reason for hiding this comment

juhoinkinen commented Feb 3, 2021

osma commented Feb 3, 2021

juhoinkinen commented Feb 1, 2021 •

edited

Loading

codecov bot commented Feb 1, 2021 •

edited

Loading