Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP fails to handle "passthrough" tensors in DDPSequentialPlugin #6048

Closed
SebastienEske opened this issue Feb 18, 2021 · 5 comments
Closed
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on won't fix This will not be worked on

Comments

@SebastienEske
Copy link

SebastienEske commented Feb 18, 2021

🐛 Bug

This bug is related to model sequential training. Considering the following network

input1 -> module1 \
input2 ----------------- --> module 2 -> output

It is possible to make it sequential by having module 1 take both inputs (input1, input2) and provide two outputs (input2, output1)
And module 2 takes the output of module 1 and provides a single output.
The module sequence is
self.sequential_module = nn.Sequential(module1(), module2())

In this case, splitting the network across 2 gpus (one module each) and using ddp fails during the back propagation stage with a tensor size mismatch because input2 has no gradient tensor and the gradient tensor of output1 get computed with the values of the first tensor in the output tuple of module 1: input2.

This does not fail when using dp accelerator or if the output of module1 is (output1, input2) (notice the different ordering).

Please reproduce using the BoringModel

A gist to reproduce the bug: gist
It is adapted from the minimal working example here

Error message:

Traceback (most recent call last):
  File "bug_example.py", line 184, in <module>
    trainer.fit(model, train_data)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 158, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 307, in ddp_train
    results = self.train_or_test()
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
    results = self.trainer.train()
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
    self.train_loop.run_training_epoch()
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 543, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 698, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 484, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1296, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 286, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 144, in __optimizer_step
    optimizer.step(closure=closure, *args, **kwargs)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/torch/optim/sgd.py", line 86, in step
    loss = closure()
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 693, in train_step_and_backward_closure
    self.trainer.hiddens
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 796, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in backward
    result.closure_loss, optimizer, opt_idx, *args, **kwargs
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 109, in backward
    model.backward(closure_loss, optimizer, opt_idx, *args, **kwargs)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1162, in backward
    loss.backward(*args, **kwargs)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ubuntu/.virtualenvs/pixelz-python36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([1, 3, 64, 64]) and output[0] has a shape of torch.Size([1, 64, 32, 32]).

To Reproduce

Use the gist linked above:

python pytorch_lightning_ddp_bug_example.py --gpus=1 --accelerator='ddp' works
python pytorch_lightning_ddp_bug_example.py --gpus=2 --accelerator='ddp' works
python pytorch_lightning_ddp_bug_example.py --gpus=2 --accelerator='dp' --fails works
python pytorch_lightning_ddp_bug_example.py --gpus=2 --accelerator='ddp' --fails fails

Expected behavior

No back propagation error

Environment

  • CUDA:
    • GPU:
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
    • available: True
    • version: 10.2
  • Packages:
    • numpy: 1.19.5
    • pyTorch_debug: False
    • pyTorch_version: 1.6.0
    • pytorch-lightning: 1.1.7
    • tqdm: 4.56.0
  • System:
@SebastienEske SebastienEske added bug Something isn't working help wanted Open to be worked on labels Feb 18, 2021
@tchaton
Copy link
Contributor

tchaton commented Feb 18, 2021

Dear @SebastienEske,

DDPSequentialPlugin is our most experimental feature and we are relying on an out-dated version of it.

Unfortunately, FairScale Pipeline expect gradient flow between gpus. Maybe you could mock it by adding an empty Module returning zero gradients, but that's starting to be hacky :)

Here is the doc from FairScale: https://fairscale.readthedocs.io/en/latest/api/nn/pipe.html
https://github.com/facebookresearch/fairscale/blob/master/examples/tutorial_pipe.py

I will work on updating our integration as soon as they have a more stable version :)

Best,
T.C

@SebastienEske
Copy link
Author

Hello @tchaton thanks for replying. :-)
Ok, as I said in the bug report, there is a simple workaround, just put the non modified tensors at the end of the tuple. So it's not a critical bug.

In any case, big thanks for doing this feature as it is super helpful for big computer vision models. Looking forward to its future improvements.

@tchaton
Copy link
Contributor

tchaton commented Feb 18, 2021

Dear @SebastienEske,

We are going to release 1.2 today. DeepSpeed has been added and can already be used. It should do some miracle with your big models. We managed to train a several billion model on a 3090 with DeepSpeed.

And I will definitely work on improving Pipe in the coming weeks.

Finally, would you like to clean out our example with real CV models and push that example to Lightning. It is really brilliant idea !

Best,
T.C

@SebastienEske
Copy link
Author

Dear @tchaton

would you like to clean out our example with real CV models and push that example to Lightning

What real CV models do you mean? Like resnet or so inside this example https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/basic_examples/conv_sequential_example.py ?

I actually liked the simple model of the example but I can try to make an example with resnet on imagenet https://github.com/PyTorchLightning/PyTorch-Lightning-Bolts/blob/master/pl_bolts/datamodules/imagenet_datamodule.py yes.

For future ref, I'll reuse the code here: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/domain_templates/imagenet.py all we need to do is add

self.sequential_module = nn.Sequential(self.conv1,
    self.bn1,
    self.relu,
    self.maxpool,
    self.layer1,
    self.layer2,
    self.layer3,
    self.layer4,
    self.avgpool,
    Flatten,
    self.fc(x)

With the custom Flatten definition from the current example and few fixes to make it all work together.

@Borda Borda added the distributed Generic distributed-related topic label Feb 18, 2021
@stale
Copy link

stale bot commented Mar 21, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants