Distributed Trainer: 2 little fixes #7461

sshleifer · 2020-09-29T21:17:22Z

fix DDP access to model.config. We could also set self.config = model.config earlier in __init__
switch torch.Tensor -> torch.tensor. The latter "infers the dtype automatically"
After which the command in Seq2SeqTrainer Distributed: AttributeError and the RuntimeError #7460 works.

src/transformers/trainer.py

sgugger · 2020-09-29T21:33:22Z

Can we see when the config is accessed (in your error message)? model.config should be accessed as sparsely as possible in Trainer to work with any kind of model and I'll probably remove the requirement entirely soon.

…rs_fork into distributed-bug-fox

sgugger · 2020-09-29T22:25:25Z

src/transformers/trainer.py


        # Distributed training (should be after apex fp16 initialization)
        if self.args.local_rank != -1:
+            config = model.config


We shouldn't assume model has a config without proper test, having Trainer work with models that are not PreTrainedModels is a feature that has been asked. If there is an access to config that makes the code fail, we should fix that place.

It's already assumed that model.config exists. The base trainer.py accesses model.config 23 times, including in the statement below this one

https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L682

sshleifer · 2020-09-30T03:45:45Z

Seq2SeqTrainer uses model.config 8 times. Mostly pad_token_id to avoid counting padding in the loss func.

sgugger · 2020-09-30T11:27:59Z

It should add an assert the model is a PreTrainedModel at init just to be clean, then for your specific problem, it should use the function self._actual_model() to grab the config to avoid your error (e.g., self.model.config -> self._actual_model().config).

Trainer is on its way to fully handle models without config, see #7464.

sshleifer · 2020-09-30T17:26:52Z

OK. I reduced scope of this PR to just the Tensor -> tensor.

sgugger

That works for me :-)

reset model.config

e60cb7e

sshleifer requested a review from sgugger September 29, 2020 21:18

sshleifer changed the title ~~reset model.config~~ Trainer: reset model.config after calling DDP Sep 29, 2020

sshleifer commented Sep 29, 2020

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

Update src/transformers/trainer.py

ea96f0d

sshleifer linked an issue Sep 29, 2020 that may be closed by this pull request

Seq2SeqTrainer Distributed: AttributeError and the RuntimeError #7460

Closed

sshleifer added 2 commits September 29, 2020 17:52

use lower case tensor

d43fa53

Merge branch 'distributed-bug-fox' of github.com:sshleifer/transforme…

e71b451

…rs_fork into distributed-bug-fox

sshleifer changed the title ~~Trainer: reset model.config after calling DDP~~ Distributed Trainer: 2 little fixes Sep 29, 2020

sgugger reviewed Sep 29, 2020

View reviewed changes

Just tensor change

6021714

sgugger approved these changes Sep 30, 2020

View reviewed changes

sshleifer merged commit 097049b into huggingface:master Oct 1, 2020

patil-suraj mentioned this pull request Oct 15, 2020

[Seq2Seq] Allow EncoderDecoderModels to be trained with Seq2Seq #7809

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed Trainer: 2 little fixes #7461

Distributed Trainer: 2 little fixes #7461

Uh oh!

sshleifer commented Sep 29, 2020 •

edited

Loading

Uh oh!

Uh oh!

sgugger commented Sep 29, 2020 •

edited

Loading

Uh oh!

sgugger Sep 29, 2020 •

edited by sshleifer

Loading

Uh oh!

sshleifer Sep 30, 2020 •

edited

Loading

Uh oh!

sshleifer commented Sep 30, 2020

Uh oh!

sgugger commented Sep 30, 2020

Uh oh!

sshleifer commented Sep 30, 2020

Uh oh!

sgugger left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Distributed Trainer: 2 little fixes #7461

Distributed Trainer: 2 little fixes #7461

Uh oh!

Conversation

sshleifer commented Sep 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sgugger commented Sep 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger Sep 29, 2020 • edited by sshleifer Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sshleifer Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sshleifer commented Sep 30, 2020

Uh oh!

sgugger commented Sep 30, 2020

Uh oh!

sshleifer commented Sep 30, 2020

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sshleifer commented Sep 29, 2020 •

edited

Loading

sgugger commented Sep 29, 2020 •

edited

Loading

sgugger Sep 29, 2020 •

edited by sshleifer

Loading

sshleifer Sep 30, 2020 •

edited

Loading