[WIP] Trainer supports all Datasets as train_dataset, with/without len #5990 #5995

j-rossi-nl · 2020-07-23T12:34:15Z

A first PR.

Passed:

make test
make style
make quality

Modifies:

trainer.py: fixes issue
test_trainer.py: calls Trainer.train()

It fixes only the case where the TRAINING dataset has not the __len__ method.

The distinction is not between Dataset and IterableDataset, but between objects that are instances of class where __len__ is implemented or not. This is pointed in pytorch source, implementation of __len__ is up to the user.
The test is therefore: isinstance(dataset, collections.Sized)

NB: fixing for EVAL and TEST dataset will require more code refactor: all get funneled to Trainer._prediction_loop() without keeping track of whether it is EVAL or TEST, which makes the usage of TrainingArguments.eval_steps impossible to assume. (not to mention, there is no test_steps field in TrainingArguments)

Not all datasets have an implementation of __len__ method, therefore the trainer should not assume it is available.

The use case is:

Dataset with __len__: use num_train_epochs or max_steps to specify how long should training run
Dataset without __len__: use only max_steps

The limitation is still valid on the EVAL / TEST dataset who still has to implement __len__

Not all datasets have an implementation of __len__ method, therefore the trainer should not assume it is available. The use case is: - Dataset with len(): use num_train_epochs or max_steps - Dataset without len(): use only max_steps The limitation is still valid on the EVAL / TEST dataset who still has to implement __len__

sgugger

Thanks a lot for your PR. Note that there is some moving around of the code in Trainer coming in #5982 (I'll probably merge it today) so you may need to adapt a bit the code.
Note that there is no test_steps field since the evaluation is supposed to be complete on the test set (users can always pass along a shorter dataset if they want).

src/transformers/trainer.py

Corrected for successful make test Corrected for no warnings make test

j-rossi-nl · 2020-07-23T13:03:42Z

Note that there is some moving around of the code in Trainer coming in #5982 (I'll probably merge it today) so you may need to adapt a bit the code.

I'll wait for the merge of #5982 and introduce the fix for #5990 after.

Note that there is no test_steps field since the evaluation is supposed to be complete on the test set (users can always pass along a shorter dataset if they want).

I see.
At the moment, the functionality is that it will refuse a dataset that does not implement __len__, whether it is the EVAL dataset or the TEST dataset.

Merged with PR #5982

codecov · 2020-07-24T20:00:44Z

Codecov Report

Merging #5995 into master will increase coverage by 0.40%.
The diff coverage is 78.57%.

@@            Coverage Diff             @@
##           master    #5995      +/-   ##
==========================================
+ Coverage   78.50%   78.90%   +0.40%     
==========================================
  Files         146      146              
  Lines       26249    26264      +15     
==========================================
+ Hits        20606    20723     +117     
+ Misses       5643     5541     -102

Impacted Files	Coverage Δ
src/transformers/trainer.py	`60.63% <78.57%> (+19.75%)`	⬆️
src/transformers/generation_tf_utils.py	`86.21% <0.00%> (+0.75%)`	⬆️
src/transformers/tokenization_bert.py	`92.23% <0.00%> (+0.91%)`	⬆️
src/transformers/optimization.py	`97.36% <0.00%> (+1.31%)`	⬆️
src/transformers/modeling_auto.py	`77.90% <0.00%> (+3.48%)`	⬆️
src/transformers/training_args.py	`86.73% <0.00%> (+6.12%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c69ea5e...4066671. Read the comment docs.

j-rossi-nl · 2020-07-24T20:03:41Z

Hi,

Merged to include all changes from Cleanup Trainer and expose customization points #5982
Now accepts IterableDataset for any dataset (TRAIN, EVAL, TEST)
Displays information (how many steps, etc...)
test_trainer.py includes an end-to-end test of train() and predict()

It fixes the issue #5990 with my code.

Checklist:

make test is positive (all passed, no failed)
make style and make quality

Looking forward to review.

I'm still unsure about how to test what the dataset is:

the type hinting says train_dataset: Dataset
pytorch indicates it is good practice to implement __len__ on Map-Style dataset, but in the code there is no way this could be enforced
pytorch relies only on the logic: user will inherit Dataset and implement __len__ or user will inherit IterableDataset and not implement __len__. In Dataloader every time there is a doubt it is checking if the object is an instance of Dataset or IterableDataset
in my code, I followed my first rule: either the object has __len__ or it does not.

Still wondering if I should code it pytorch-style and trust the given answer blindly, or keep it the way it is done, which is a bit more paranoid ?
Any comment appreciated.

sgugger

Thanks a lot for taking care of the merge! There were still some conflicts, but I dealt with them.

I like the fact that you directly test if the Dataset has a length or not. It seems more solid since PyTorch does not check during their Dataset init if they actually have a length, so we can't really rely on the distinction Dataset/IterableDataset.

I just think that since (for now) the choice is to reject "infinite" datasets at evaluation/prediction (which seems ok to me since I don't see how this could actually give some results), we should be consistent in the code. I've made a few comments to that effect if you can adapt your PR.

src/transformers/trainer.py

improved test coverage wrt use cases. iterable datasets allowed only for training. exceptions raised for evaluation / prediction.

…trainer_iterable_datasets

j-rossi-nl · 2020-07-25T08:10:26Z

the test is whether a Dataset object has __len__ or not
iterable dataset is OK for training, only if max_steps has a strictly positive value
iterable dataset is not acceptable for evaluation or prediction

The confusion about eval_steps has been purged.

The test test_trainer_iterable_dataset in test_trainer.py has been extended to check for corner cases and associated exceptions. Only the exception type is checked, not exception message.

Checklist:

make test passed (no fail)
make style
make quality

Looking forward to review.

src/transformers/trainer.py

sgugger

Just two nit-picking, but this looks great to me, thanks for the work!

testing for dataset implementing __len__ done multiple times on the call stack. Once is enough.

j-rossi-nl · 2020-07-27T13:36:15Z

Nitpicking is fine in my book.

removed redundant test on dataset (eval / test)

Checklist:

make test passed (no fail)
make style
make quality

Up for review again.

LysandreJik

Great!

src/transformers/trainer.py

tests/test_trainer.py

stale · 2020-10-04T01:13:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

carson-sestili · 2020-10-15T23:20:46Z

Hi, is this PR still being reviewed? I would like to use Trainer with an IterableDataset and this looks like exactly what's needed to make that happen. If you have time, I would greatly appreciate this PR to get into the next version :) thank you!

j-rossi-nl · 2020-10-16T11:57:07Z

I realize a part has been merged, but not everything.

j-rossi-nl · 2020-10-16T11:57:45Z

And I can't seem to find a way to re-open this PR. So I guess, I should open a new one, and link to this one...

j-rossi-nl · 2020-10-19T16:02:57Z

@carson-sestili The PR #7858 has been merged to master and fixes the bug.
You can already use it by installing from source.

carson-sestili · 2020-10-20T20:08:36Z

@j-rossi-nl Thank you very much!

j-rossi-nl added 3 commits July 23, 2020 14:13

j-rossi-nl marked this pull request as draft July 23, 2020 12:42

sgugger reviewed Jul 23, 2020

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

Issue #5990

8dd9ae0

Corrected for successful make test Corrected for no warnings make test

j-rossi-nl changed the title ~~Trainer supports all Datasets as train_dataset, with/without __len__ #5990~~ [WIP] Trainer supports all Datasets as train_dataset, with/without __len__ #5990 Jul 23, 2020

Issue #5990

2f84605

Merged with PR #5982

j-rossi-nl marked this pull request as ready for review July 24, 2020 19:53

j-rossi-nl requested a review from sgugger July 24, 2020 20:08

Merge branch 'master' into trainer_iterable_datasets

89b6f46

sgugger suggested changes Jul 24, 2020

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

src/transformers/trainer.py Outdated Show resolved Hide resolved

src/transformers/trainer.py Show resolved Hide resolved

j-rossi-nl added 2 commits July 25, 2020 09:57

iterable datasets only for training

ccf4063

improved test coverage wrt use cases. iterable datasets allowed only for training. exceptions raised for evaluation / prediction.

Merge remote-tracking branch 'origin/trainer_iterable_datasets' into …

1e08591

…trainer_iterable_datasets

j-rossi-nl requested a review from sgugger July 25, 2020 08:10

sgugger reviewed Jul 27, 2020

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

sgugger reviewed Jul 27, 2020

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

sgugger approved these changes Jul 27, 2020

View reviewed changes

sgugger requested review from LysandreJik and julien-c July 27, 2020 13:09

clean up redundant test

4066671

testing for dataset implementing __len__ done multiple times on the call stack. Once is enough.

j-rossi-nl requested a review from sgugger July 30, 2020 07:40

LysandreJik approved these changes Aug 4, 2020

View reviewed changes

src/transformers/trainer.py Show resolved Hide resolved

src/transformers/trainer.py Show resolved Hide resolved

src/transformers/trainer.py Show resolved Hide resolved

tests/test_trainer.py Show resolved Hide resolved

stale bot added the wontfix label Oct 4, 2020

stale bot closed this Oct 11, 2020

This was referenced Oct 16, 2020

Trainer accepts iterable datasets #7851

Closed

Trainer with Iterable Dataset #7858

Merged

[WIP] Trainer supports all Datasets as train_dataset, with/without __len__ #5990 #5995

[WIP] Trainer supports all Datasets as train_dataset, with/without __len__ #5990 #5995

Uh oh!

Conversation

j-rossi-nl commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

j-rossi-nl commented Jul 23, 2020

Uh oh!

codecov bot commented Jul 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

j-rossi-nl commented Jul 24, 2020

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

j-rossi-nl commented Jul 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

j-rossi-nl commented Jul 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stale bot commented Oct 4, 2020

Uh oh!

carson-sestili commented Oct 15, 2020

Uh oh!

j-rossi-nl commented Oct 16, 2020

Uh oh!

j-rossi-nl commented Oct 16, 2020

Uh oh!

j-rossi-nl commented Oct 19, 2020

Uh oh!

carson-sestili commented Oct 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[WIP] Trainer supports all Datasets as train_dataset, with/without len #5990 #5995

[WIP] Trainer supports all Datasets as train_dataset, with/without len #5990 #5995

j-rossi-nl commented Jul 23, 2020 •

edited

Loading

codecov bot commented Jul 24, 2020 •

edited

Loading

j-rossi-nl commented Jul 25, 2020 •

edited

Loading

j-rossi-nl commented Jul 27, 2020 •

edited

Loading