DDP fails to handle "passthrough" tensors in DDPSequentialPlugin #6048

SebastienEske · 2021-02-18T08:15:41Z

🐛 Bug

This bug is related to model sequential training. Considering the following network

input1 -> module1 \
input2 ----------------- --> module 2 -> output

It is possible to make it sequential by having module 1 take both inputs (input1, input2) and provide two outputs (input2, output1)
And module 2 takes the output of module 1 and provides a single output.
The module sequence is
self.sequential_module = nn.Sequential(module1(), module2())

In this case, splitting the network across 2 gpus (one module each) and using ddp fails during the back propagation stage with a tensor size mismatch because input2 has no gradient tensor and the gradient tensor of output1 get computed with the values of the first tensor in the output tuple of module 1: input2.

This does not fail when using dp accelerator or if the output of module1 is (output1, input2) (notice the different ordering).

Please reproduce using the BoringModel

A gist to reproduce the bug: gist
It is adapted from the minimal working example here

Error message:

Traceback (most recent call last):
  File "bug_example.py", line 184, in <module>
    trainer.fit(model, train_data)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 158, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 307, in ddp_train
    results = self.train_or_test()
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
    results = self.trainer.train()
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
    self.train_loop.run_training_epoch()
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 543, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 698, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 484, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1296, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 286, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 144, in __optimizer_step
    optimizer.step(closure=closure, *args, **kwargs)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/torch/optim/sgd.py", line 86, in step
    loss = closure()
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 693, in train_step_and_backward_closure
    self.trainer.hiddens
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 796, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in backward
    result.closure_loss, optimizer, opt_idx, *args, **kwargs
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 109, in backward
    model.backward(closure_loss, optimizer, opt_idx, *args, **kwargs)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1162, in backward
    loss.backward(*args, **kwargs)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([1, 3, 64, 64]) and output[0] has a shape of torch.Size([1, 64, 32, 32]).

To Reproduce

Use the gist linked above:

python pytorch_lightning_ddp_bug_example.py --gpus=1 --accelerator='ddp' works
python pytorch_lightning_ddp_bug_example.py --gpus=2 --accelerator='ddp' works
python pytorch_lightning_ddp_bug_example.py --gpus=2 --accelerator='dp' --fails works
python pytorch_lightning_ddp_bug_example.py --gpus=2 --accelerator='ddp' --fails fails

Expected behavior

No back propagation error

Environment

CUDA:
- GPU:
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
- available: True
- version: 10.2
Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 1.1.7
- tqdm: 4.56.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.6.12
- version: Returning None in validation_end method raises error #84-Ubuntu SMP Thu Dec 6 08:57:58 UTC 2018

The text was updated successfully, but these errors were encountered:

tchaton · 2021-02-18T08:35:36Z

Dear @SebastienEske,

DDPSequentialPlugin is our most experimental feature and we are relying on an out-dated version of it.

Unfortunately, FairScale Pipeline expect gradient flow between gpus. Maybe you could mock it by adding an empty Module returning zero gradients, but that's starting to be hacky :)

Here is the doc from FairScale: https://fairscale.readthedocs.io/en/latest/api/nn/pipe.html
https://github.com/facebookresearch/fairscale/blob/master/examples/tutorial_pipe.py

I will work on updating our integration as soon as they have a more stable version :)

Best,
T.C

SebastienEske · 2021-02-18T08:58:45Z

Hello @tchaton thanks for replying. :-)
Ok, as I said in the bug report, there is a simple workaround, just put the non modified tensors at the end of the tuple. So it's not a critical bug.

In any case, big thanks for doing this feature as it is super helpful for big computer vision models. Looking forward to its future improvements.

tchaton · 2021-02-18T09:34:40Z

Dear @SebastienEske,

We are going to release 1.2 today. DeepSpeed has been added and can already be used. It should do some miracle with your big models. We managed to train a several billion model on a 3090 with DeepSpeed.

And I will definitely work on improving Pipe in the coming weeks.

Finally, would you like to clean out our example with real CV models and push that example to Lightning. It is really brilliant idea !

Best,
T.C

SebastienEske · 2021-02-18T10:03:07Z

Dear @tchaton

would you like to clean out our example with real CV models and push that example to Lightning

What real CV models do you mean? Like resnet or so inside this example https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/basic_examples/conv_sequential_example.py ?

I actually liked the simple model of the example but I can try to make an example with resnet on imagenet https://github.com/PyTorchLightning/PyTorch-Lightning-Bolts/blob/master/pl_bolts/datamodules/imagenet_datamodule.py yes.

For future ref, I'll reuse the code here: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/domain_templates/imagenet.py all we need to do is add

self.sequential_module = nn.Sequential(self.conv1,
    self.bn1,
    self.relu,
    self.maxpool,
    self.layer1,
    self.layer2,
    self.layer3,
    self.layer4,
    self.avgpool,
    Flatten,
    self.fc(x)

With the custom Flatten definition from the current example and few fixes to make it all work together.

stale · 2021-03-21T02:20:17Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

SebastienEske added bug Something isn't working help wanted Open to be worked on labels Feb 18, 2021

Borda added the distributed Generic distributed-related topic label Feb 18, 2021

stale bot added the won't fix This will not be worked on label Mar 21, 2021

stale bot closed this as completed Mar 28, 2021

slumericanBx mentioned this issue Dec 27, 2023

[Snyk] Security upgrade axios from 0.25.0 to 1.6.3 slumericanBx/lightning#77

Open

slumericanBx mentioned this issue Jan 5, 2024

[Snyk] Security upgrade axios from 0.25.0 to 1.6.4 slumericanBx/lightning#84

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP fails to handle "passthrough" tensors in DDPSequentialPlugin #6048

DDP fails to handle "passthrough" tensors in DDPSequentialPlugin #6048

SebastienEske commented Feb 18, 2021 •

edited

Loading

tchaton commented Feb 18, 2021 •

edited

Loading

SebastienEske commented Feb 18, 2021

tchaton commented Feb 18, 2021 •

edited

Loading

SebastienEske commented Feb 18, 2021

stale bot commented Mar 21, 2021

DDP fails to handle "passthrough" tensors in DDPSequentialPlugin #6048

DDP fails to handle "passthrough" tensors in DDPSequentialPlugin #6048

Comments

SebastienEske commented Feb 18, 2021 • edited Loading

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

tchaton commented Feb 18, 2021 • edited Loading

SebastienEske commented Feb 18, 2021

tchaton commented Feb 18, 2021 • edited Loading

SebastienEske commented Feb 18, 2021

stale bot commented Mar 21, 2021

SebastienEske commented Feb 18, 2021 •

edited

Loading

tchaton commented Feb 18, 2021 •

edited

Loading

tchaton commented Feb 18, 2021 •

edited

Loading