-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP fails to handle "passthrough" tensors in DDPSequentialPlugin #6048
Comments
Dear @SebastienEske, DDPSequentialPlugin is our most Unfortunately, FairScale Pipeline expect gradient flow between gpus. Maybe you could mock it by adding an empty Module returning zero gradients, but that's starting to be hacky :) Here is the doc from FairScale: https://fairscale.readthedocs.io/en/latest/api/nn/pipe.html I will work on updating our integration as soon as they have a more stable version :) Best, |
Hello @tchaton thanks for replying. :-) In any case, big thanks for doing this feature as it is super helpful for big computer vision models. Looking forward to its future improvements. |
Dear @SebastienEske, We are going to release 1.2 today. DeepSpeed has been added and can already be used. It should do some miracle with your big models. We managed to train a several billion model on a 3090 with DeepSpeed. And I will definitely work on improving Pipe in the coming weeks. Finally, would you like to clean out our example with real CV models and push that example to Lightning. It is really brilliant idea ! Best, |
Dear @tchaton
What real CV models do you mean? Like resnet or so inside this example https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/basic_examples/conv_sequential_example.py ? I actually liked the simple model of the example but I can try to make an example with resnet on imagenet https://github.com/PyTorchLightning/PyTorch-Lightning-Bolts/blob/master/pl_bolts/datamodules/imagenet_datamodule.py yes. For future ref, I'll reuse the code here: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/domain_templates/imagenet.py all we need to do is add
With the custom Flatten definition from the current example and few fixes to make it all work together. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
🐛 Bug
This bug is related to model sequential training. Considering the following network
input1 -> module1 \
input2 ----------------- --> module 2 -> output
It is possible to make it sequential by having module 1 take both inputs
(input1, input2)
and provide two outputs(input2, output1)
And module 2 takes the output of module 1 and provides a single output.
The module sequence is
self.sequential_module = nn.Sequential(module1(), module2())
In this case, splitting the network across 2 gpus (one module each) and using ddp fails during the back propagation stage with a tensor size mismatch because input2 has no gradient tensor and the gradient tensor of output1 get computed with the values of the first tensor in the output tuple of module 1: input2.
This does not fail when using dp accelerator or if the output of module1 is
(output1, input2)
(notice the different ordering).Please reproduce using the BoringModel
A gist to reproduce the bug: gist
It is adapted from the minimal working example here
Error message:
To Reproduce
Use the gist linked above:
python pytorch_lightning_ddp_bug_example.py --gpus=1 --accelerator='ddp'
workspython pytorch_lightning_ddp_bug_example.py --gpus=2 --accelerator='ddp'
workspython pytorch_lightning_ddp_bug_example.py --gpus=2 --accelerator='dp' --fails
workspython pytorch_lightning_ddp_bug_example.py --gpus=2 --accelerator='ddp' --fails
failsExpected behavior
No back propagation error
Environment
The text was updated successfully, but these errors were encountered: