Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit document number CLI option #465

Merged
merged 4 commits into from
Feb 3, 2021
Merged

Conversation

juhoinkinen
Copy link
Member

@juhoinkinen juhoinkinen commented Feb 1, 2021

This adds a CLI option --docs-limit/-dl--docs-limit/-d to the commands where it is applicable. The option can be used to limit the number of documents to process to create learning-curve data, for example. Learning curves can help to estimate "what is enough training data".

This CLI option can be used in a shell script with training and evaluation steps, and the complex implementation of #364 can be abandoned.

@juhoinkinen juhoinkinen added this to the 0.51 milestone Feb 1, 2021
@codecov
Copy link

codecov bot commented Feb 1, 2021

Codecov Report

Merging #465 (a59efc7) into master (267cc85) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #465      +/-   ##
==========================================
+ Coverage   99.41%   99.44%   +0.02%     
==========================================
  Files          65       67       +2     
  Lines        4631     4850     +219     
==========================================
+ Hits         4604     4823     +219     
  Misses         27       27              
Impacted Files Coverage Δ
annif/corpus/__init__.py 100.00% <ø> (ø)
annif/cli.py 99.61% <100.00%> (+0.01%) ⬆️
annif/corpus/document.py 100.00% <100.00%> (ø)
tests/test_cli.py 100.00% <100.00%> (ø)
tests/test_corpus.py 100.00% <100.00%> (ø)
annif/vocab.py 100.00% <0.00%> (ø)
tests/test_vocab.py 100.00% <0.00%> (ø)
annif/backend/__init__.py 100.00% <0.00%> (ø)
annif/backend/stwfsa.py 100.00% <0.00%> (ø)
tests/test_backend_stwfsa.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 267cc85...a59efc7. Read the comment docs.

Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unix command line options are generally either short (with one dash) or long (with two dashes). Short options are just one character. I suggest changing -dl to -d so it follows this convention. No Annif command currently uses a -d option so it's available.

Other than that, the implementation looks very nice and good for merging!

@sonarcloud
Copy link

sonarcloud bot commented Feb 3, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@juhoinkinen
Copy link
Member Author

In learning-curve runs I noticed that training nn-ensemble model on zero or one documents fails:

$ annif train yso-fi ../Annif-corpora/fulltext/jyu-theses/fin-maui-train/ -d 1
Backend nn_ensemble: creating NN ensemble model
2021-02-03 12:04:07.753489: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-03 12:04:07.757431: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 1800000000 Hz
2021-02-03 12:04:07.757935: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b0747e3930 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-02-03 12:04:07.757957: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Epoch 1/20
Traceback (most recent call last):
  File "/home/local/jmminkin/git/Annif/venv/bin/annif", line 33, in <module>
    sys.exit(load_entry_point('annif', 'console_scripts', 'annif')())
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/flask/cli.py", line 586, in main
    return super(FlaskGroup, self).main(*args, **kwargs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/flask/cli.py", line 426, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/local/jmminkin/git/Annif/annif/cli.py", line 211, in run_train
    proj.train(documents, backend_params)
  File "/home/local/jmminkin/git/Annif/annif/project.py", line 191, in train
    self.backend.train(corpus, beparams)
  File "/home/local/jmminkin/git/Annif/annif/backend/backend.py", line 79, in train
    return self._train(corpus, params=beparams)
  File "/home/local/jmminkin/git/Annif/annif/backend/nn_ensemble.py", line 159, in _train
    self._fit_model(corpus, epochs=int(params['epochs']))
  File "/home/local/jmminkin/git/Annif/annif/backend/nn_ensemble.py", line 197, in _fit_model
    self._model.fit(seq, verbose=True, epochs=epochs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/local/jmminkin/git/Annif/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1104, in fit
    epoch_logs = copy.copy(logs)
UnboundLocalError: local variable 'logs' referenced before assignment

But that crash comes from TensorFlow, and also training on /dev/null produces it, so it's not due to this PR and could be handled somewhere else (if needed), so I'll merge this.

I don't remember why but the nn-ensemble was meant to be trainable on an empty corpus (otherwise checking the corpus by is_empty could be used).

@osma
Copy link
Member

osma commented Feb 3, 2021

I don't remember why but the nn-ensemble was meant to be trainable on an empty corpus (otherwise checking the corpus by is_empty could be used).

The idea here was IIRC that you could train on an empty corpus so you would end up with an NN ensemble model that is essentially equivalent to a plain ensemble - it just does averaging, with no adjustment of scores. Then you could use learn to train it incrementally. But it's possible that this has never actually worked...

@juhoinkinen juhoinkinen merged commit c2cd230 into master Feb 3, 2021
@juhoinkinen juhoinkinen deleted the limit-document-number branch February 3, 2021 11:59
@juhoinkinen juhoinkinen mentioned this pull request Feb 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants