Avoid accessing .dataset of a DataLoader in Trainer by sanderland · Pull Request #16451 · huggingface/transformers

sanderland · 2022-03-28T14:19:33Z

What does this PR do?

Respects get_train_dataloader and such, rather than going back and looking at .train_dataset or requiring attributes in the dataloader to be accessible directly.
- This allows for overriding it by any object which implements the methods required by a DataLoader (__len__ and __iter__) without additional requirements.
- The original motivation was to train on a multi-task dataloader which defers to multiple dataloaders.

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
- Discussed in Confusing interaction between training dataloaders and datasets in Trainer #16388
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2022-03-28T14:33:10Z

The documentation is not available anymore as the PR was closed or merged.

…tances of length checks

sanderland · 2022-03-29T10:22:20Z

@sgugger this should be ready for review.
You were right that there were a couple of more places to change, and the logic is quite inconsistent in places. I've tried to be on the defensive side in covering cases:

dataloader.dataset can exist or not, and have a length or not
dataloader always has a len, but it can raise an exception in fairly common cases

This implementation works for my particular case, giving the same output in training+evaluation as before, but without the really painful workarounds.

I had a look at tests and they look complicated, so I will add some after getting confirmation that this is ok otherwise.

sgugger

Thanks for your PR! It tries to do too much at the same time however. There is no reason to change the signature of the function get_train_dataloader so it should be left as is IMO. Even if it's a change we would like to implement, it should be done on it own separate PR.

Then there is a lot of code that could be refactored using the has_length function (and improving it a tiny bit).

Lastly, this PR breaks the current logging of number of examples, this should be fixed before we can merge it.

src/transformers/trainer.py

sgugger · 2022-03-29T14:23:19Z

src/transformers/trainer.py

+        len_dataloader = None
+        try:
+            len_dataloader = len(train_dataloader)
+        except (NameError, TypeError):  # Default dataloader calls len(dataset), which may not exist


We have a function has_length that would simplify the code greatly here, we can add the NameError inside it.

Refactoring as suggested, although has_length is a bit of a confusing name for __len__ does not raise an exception"

sgugger · 2022-03-29T14:24:34Z

src/transformers/trainer.py

-        )
-
        logger.info("***** Running training *****")
        logger.info(f"  Num examples = {num_examples}")


The code will error here since you're not defining num_examples anymore.

num_examples was moved up inside the if statements that deal with the len/steps/size cases

src/transformers/trainer.py

sanderland · 2022-03-29T16:16:56Z

src/transformers/trainer.py

                num_train_epochs = math.ceil(args.num_train_epochs)
-                num_train_samples = len(self.train_dataset) * args.num_train_epochs
-        else:
-            # see __init__. max_steps is set when the dataset has no __len__


Note that this comment was incorrect, it would still be -1 which causes strange outputs. Have change it to make it explicit that this should be set.

sgugger

Thanks for adapting, I added a few comments on the tests.

tests/trainer/test_trainer.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

…ransformers into trainer-better-length-check

sgugger · 2022-03-29T19:00:15Z

Thanks for implementing all the tweaks!

Sander Land added 2 commits March 28, 2022 16:15

Avoid accessing .dataset of a dataloader

aaf22e9

style

c87e3f7

Sander Land added 5 commits March 28, 2022 16:45

fix

41598a6

cleaning up, reverting some misunderstandings

00b82f6

black

4ee0a57

add train_dataset argument to get_train_dataloader, and fix other ins…

815960c

…tances of length checks

flake8

4c02f8c

sanderland marked this pull request as ready for review March 29, 2022 10:17

LysandreJik requested a review from sgugger March 29, 2022 10:49

sgugger suggested changes Mar 29, 2022

View reviewed changes

Sander Land added 2 commits March 29, 2022 17:54

address comments

87f45ec

fix bug

761ab98

sanderland commented Mar 29, 2022

View reviewed changes

Sander Land added 2 commits March 29, 2022 18:18

cleanup

eef37c1

add test

a169724

sgugger approved these changes Mar 29, 2022

View reviewed changes

tests/trainer/test_trainer.py Outdated Show resolved Hide resolved

tests/trainer/test_trainer.py Outdated Show resolved Hide resolved

tests/trainer/test_trainer.py Outdated Show resolved Hide resolved

sgugger reviewed Mar 29, 2022

View reviewed changes

tests/trainer/test_trainer.py Outdated Show resolved Hide resolved

sanderland and others added 5 commits March 29, 2022 19:19

Update tests/trainer/test_trainer.py

2ca2f3f

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

under torch

b36f278

Merge branch 'trainer-better-length-check' of github.com:sanderland/t…

488905e

…ransformers into trainer-better-length-check

merge

5ee2d7b

stylistic suggestion

1f8081f

sgugger merged commit d7c8ce5 into huggingface:main Mar 29, 2022

Comments

Conversation

sanderland commented Mar 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

HuggingFaceDocBuilderDev commented Mar 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanderland commented Mar 29, 2022

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sgugger Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

sanderland Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgugger Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

sanderland Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanderland Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sgugger commented Mar 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sanderland commented Mar 28, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 28, 2022 •

edited

Loading

sanderland Mar 29, 2022 •

edited

Loading