-
Notifications
You must be signed in to change notification settings - Fork 31.9k
[WIP] Trainer supports all Datasets as train_dataset, with/without __len__ #5990 #5995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Trainer supports all Datasets as train_dataset, with/without __len__ #5990 #5995
Conversation
Not all datasets have an implementation of __len__ method, therefore the trainer should not assume it is available. The use case is: - Dataset with len(): use num_train_epochs or max_steps - Dataset without len(): use only max_steps The limitation is still valid on the EVAL / TEST dataset who still has to implement __len__
Not all datasets have an implementation of __len__ method, therefore the trainer should not assume it is available. The use case is: - Dataset with len(): use num_train_epochs or max_steps - Dataset without len(): use only max_steps The limitation is still valid on the EVAL / TEST dataset who still has to implement __len__
Not all datasets have an implementation of __len__ method, therefore the trainer should not assume it is available. The use case is: - Dataset with len(): use num_train_epochs or max_steps - Dataset without len(): use only max_steps The limitation is still valid on the EVAL / TEST dataset who still has to implement __len__
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your PR. Note that there is some moving around of the code in Trainer coming in #5982 (I'll probably merge it today) so you may need to adapt a bit the code.
Note that there is no test_steps field since the evaluation is supposed to be complete on the test set (users can always pass along a shorter dataset if they want).
I'll wait for the merge of #5982 and introduce the fix for #5990 after.
I see. |
Codecov Report
@@ Coverage Diff @@
## master #5995 +/- ##
==========================================
+ Coverage 78.50% 78.90% +0.40%
==========================================
Files 146 146
Lines 26249 26264 +15
==========================================
+ Hits 20606 20723 +117
+ Misses 5643 5541 -102
Continue to review full report at Codecov.
|
|
Hi,
It fixes the issue #5990 with my code. Checklist:
Looking forward to review. I'm still unsure about how to test what the dataset is:
Still wondering if I should code it |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for taking care of the merge! There were still some conflicts, but I dealt with them.
I like the fact that you directly test if the Dataset has a length or not. It seems more solid since PyTorch does not check during their Dataset init if they actually have a length, so we can't really rely on the distinction Dataset/IterableDataset.
I just think that since (for now) the choice is to reject "infinite" datasets at evaluation/prediction (which seems ok to me since I don't see how this could actually give some results), we should be consistent in the code. I've made a few comments to that effect if you can adapt your PR.
improved test coverage wrt use cases. iterable datasets allowed only for training. exceptions raised for evaluation / prediction.
…trainer_iterable_datasets
The confusion about The test Checklist:
Looking forward to review. |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two nit-picking, but this looks great to me, thanks for the work!
testing for dataset implementing __len__ done multiple times on the call stack. Once is enough.
|
Nitpicking is fine in my book.
Checklist:
Up for review again. |
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
Hi, is this PR still being reviewed? I would like to use |
|
I realize a part has been merged, but not everything. |
|
And I can't seem to find a way to re-open this PR. So I guess, I should open a new one, and link to this one... |
|
@carson-sestili The PR #7858 has been merged to master and fixes the bug. |
|
@j-rossi-nl Thank you very much! |
A first PR.
Passed:
make testmake stylemake qualityModifies:
Trainer.train()It fixes only the case where the TRAINING dataset has not the
__len__method.The distinction is not between
DatasetandIterableDataset, but between objects that are instances of class where__len__is implemented or not. This is pointed in pytorch source, implementation of__len__is up to the user.The test is therefore:
isinstance(dataset, collections.Sized)NB: fixing for EVAL and TEST dataset will require more code refactor: all get funneled to
Trainer._prediction_loop()without keeping track of whether it is EVAL or TEST, which makes the usage ofTrainingArguments.eval_stepsimpossible to assume. (not to mention, there is notest_stepsfield inTrainingArguments)Not all datasets have an implementation of
__len__method, therefore the trainer should not assume it is available.The use case is:
__len__: usenum_train_epochsormax_stepsto specify how long should training run__len__: use onlymax_stepsThe limitation is still valid on the EVAL / TEST dataset who still has to implement
__len__