Skip to content

Conversation

@j-rossi-nl
Copy link
Contributor

@j-rossi-nl j-rossi-nl commented Jul 23, 2020

A first PR.

Passed:

  • make test
  • make style
  • make quality

Modifies:

  • trainer.py: fixes issue
  • test_trainer.py: calls Trainer.train()

It fixes only the case where the TRAINING dataset has not the __len__ method.

The distinction is not between Dataset and IterableDataset, but between objects that are instances of class where __len__ is implemented or not. This is pointed in pytorch source, implementation of __len__ is up to the user.
The test is therefore: isinstance(dataset, collections.Sized)

NB: fixing for EVAL and TEST dataset will require more code refactor: all get funneled to Trainer._prediction_loop() without keeping track of whether it is EVAL or TEST, which makes the usage of TrainingArguments.eval_steps impossible to assume. (not to mention, there is no test_steps field in TrainingArguments)

Not all datasets have an implementation of __len__ method, therefore the trainer should not assume it is available.

The use case is:

  • Dataset with __len__: use num_train_epochs or max_steps to specify how long should training run
  • Dataset without __len__: use only max_steps

The limitation is still valid on the EVAL / TEST dataset who still has to implement __len__

Not all datasets have an implementation of __len__ method, therefore the trainer should not assume it is available.

The use case is:
- Dataset with len(): use num_train_epochs or max_steps
- Dataset without len(): use only max_steps

The limitation is still valid on the EVAL / TEST dataset who still has to implement __len__
Not all datasets have an implementation of __len__ method, therefore the trainer should not assume it is available.

The use case is:
- Dataset with len(): use num_train_epochs or max_steps
- Dataset without len(): use only max_steps

The limitation is still valid on the EVAL / TEST dataset who still has to implement __len__
Not all datasets have an implementation of __len__ method, therefore the trainer should not assume it is available.

The use case is:
- Dataset with len(): use num_train_epochs or max_steps
- Dataset without len(): use only max_steps

The limitation is still valid on the EVAL / TEST dataset who still has to implement __len__
@j-rossi-nl j-rossi-nl marked this pull request as draft July 23, 2020 12:42
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your PR. Note that there is some moving around of the code in Trainer coming in #5982 (I'll probably merge it today) so you may need to adapt a bit the code.
Note that there is no test_steps field since the evaluation is supposed to be complete on the test set (users can always pass along a shorter dataset if they want).

Corrected for successful make test
Corrected for no warnings make test
@j-rossi-nl
Copy link
Contributor Author

Note that there is some moving around of the code in Trainer coming in #5982 (I'll probably merge it today) so you may need to adapt a bit the code.

I'll wait for the merge of #5982 and introduce the fix for #5990 after.

Note that there is no test_steps field since the evaluation is supposed to be complete on the test set (users can always pass along a shorter dataset if they want).

I see.
At the moment, the functionality is that it will refuse a dataset that does not implement __len__, whether it is the EVAL dataset or the TEST dataset.

@j-rossi-nl j-rossi-nl changed the title Trainer supports all Datasets as train_dataset, with/without __len__ #5990 [WIP] Trainer supports all Datasets as train_dataset, with/without __len__ #5990 Jul 23, 2020
Merged with PR #5982
@j-rossi-nl j-rossi-nl marked this pull request as ready for review July 24, 2020 19:53
@codecov
Copy link

codecov bot commented Jul 24, 2020

Codecov Report

Merging #5995 into master will increase coverage by 0.40%.
The diff coverage is 78.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5995      +/-   ##
==========================================
+ Coverage   78.50%   78.90%   +0.40%     
==========================================
  Files         146      146              
  Lines       26249    26264      +15     
==========================================
+ Hits        20606    20723     +117     
+ Misses       5643     5541     -102     
Impacted Files Coverage Δ
src/transformers/trainer.py 60.63% <78.57%> (+19.75%) ⬆️
src/transformers/generation_tf_utils.py 86.21% <0.00%> (+0.75%) ⬆️
src/transformers/tokenization_bert.py 92.23% <0.00%> (+0.91%) ⬆️
src/transformers/optimization.py 97.36% <0.00%> (+1.31%) ⬆️
src/transformers/modeling_auto.py 77.90% <0.00%> (+3.48%) ⬆️
src/transformers/training_args.py 86.73% <0.00%> (+6.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c69ea5e...4066671. Read the comment docs.

@j-rossi-nl
Copy link
Contributor Author

Hi,

  • Merged to include all changes from Cleanup Trainer and expose customization points #5982
  • Now accepts IterableDataset for any dataset (TRAIN, EVAL, TEST)
  • Displays information (how many steps, etc...)
  • test_trainer.py includes an end-to-end test of train() and predict()

It fixes the issue #5990 with my code.

Checklist:

  • make test is positive (all passed, no failed)
  • make style and make quality

Looking forward to review.

I'm still unsure about how to test what the dataset is:

  • the type hinting says train_dataset: Dataset
  • pytorch indicates it is good practice to implement __len__ on Map-Style dataset, but in the code there is no way this could be enforced
  • pytorch relies only on the logic: user will inherit Dataset and implement __len__ or user will inherit IterableDataset and not implement __len__. In Dataloader every time there is a doubt it is checking if the object is an instance of Dataset or IterableDataset
  • in my code, I followed my first rule: either the object has __len__ or it does not.

Still wondering if I should code it pytorch-style and trust the given answer blindly, or keep it the way it is done, which is a bit more paranoid ?
Any comment appreciated.

@j-rossi-nl j-rossi-nl requested a review from sgugger July 24, 2020 20:08
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for taking care of the merge! There were still some conflicts, but I dealt with them.

I like the fact that you directly test if the Dataset has a length or not. It seems more solid since PyTorch does not check during their Dataset init if they actually have a length, so we can't really rely on the distinction Dataset/IterableDataset.

I just think that since (for now) the choice is to reject "infinite" datasets at evaluation/prediction (which seems ok to me since I don't see how this could actually give some results), we should be consistent in the code. I've made a few comments to that effect if you can adapt your PR.

improved test coverage wrt use cases.
iterable datasets allowed only for training.
exceptions raised for evaluation / prediction.
@j-rossi-nl
Copy link
Contributor Author

j-rossi-nl commented Jul 25, 2020

  • the test is whether a Dataset object has __len__ or not
  • iterable dataset is OK for training, only if max_steps has a strictly positive value
  • iterable dataset is not acceptable for evaluation or prediction

The confusion about eval_steps has been purged.

The test test_trainer_iterable_dataset in test_trainer.py has been extended to check for corner cases and associated exceptions. Only the exception type is checked, not exception message.

Checklist:

  • make test passed (no fail)
  • make style
  • make quality

Looking forward to review.

@j-rossi-nl j-rossi-nl requested a review from sgugger July 25, 2020 08:10
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two nit-picking, but this looks great to me, thanks for the work!

@sgugger sgugger requested review from LysandreJik and julien-c July 27, 2020 13:09
testing for dataset implementing __len__ done multiple times on the call stack. Once is enough.
@j-rossi-nl
Copy link
Contributor Author

j-rossi-nl commented Jul 27, 2020

Nitpicking is fine in my book.

  • removed redundant test on dataset (eval / test)

Checklist:

  • make test passed (no fail)
  • make style
  • make quality

Up for review again.

@j-rossi-nl j-rossi-nl requested a review from sgugger July 30, 2020 07:40
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@stale
Copy link

stale bot commented Oct 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Oct 4, 2020
@stale stale bot closed this Oct 11, 2020
@carson-sestili
Copy link

Hi, is this PR still being reviewed? I would like to use Trainer with an IterableDataset and this looks like exactly what's needed to make that happen. If you have time, I would greatly appreciate this PR to get into the next version :) thank you!

@j-rossi-nl
Copy link
Contributor Author

I realize a part has been merged, but not everything.

@j-rossi-nl
Copy link
Contributor Author

And I can't seem to find a way to re-open this PR. So I guess, I should open a new one, and link to this one...

@j-rossi-nl
Copy link
Contributor Author

@carson-sestili The PR #7858 has been merged to master and fixes the bug.
You can already use it by installing from source.

@carson-sestili
Copy link

@j-rossi-nl Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants