-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Parallel bug (return outputs not being moved to same device) #4073
Comments
import torch
# this is what is happening in will's code:
prediction = torch.rand(8, 2, requires_grad=True)
# device 0 computes:
x = torch.nn.functional.cross_entropy(
prediction,
torch.ones(len(prediction), dtype=torch.long, device=prediction.device),
)
# devices 1 computes:
y = torch.nn.functional.cross_entropy(
prediction,
torch.ones(len(prediction), dtype=torch.long, device=prediction.device),
)
# dp backend calls backward on stacked tensor
l = torch.stack((x, y))
l.backward() # backward on a non-scalar Here is the pytorch code that shows the problem. Gives the same error as reported by @willprice
Conclusion: Somewhere in the dp backend the losses get stacked and backward is called on a non-scalar tensor. |
@willprice can you check the fix fro #4138 ? For me this worked on the reproduction script. |
@willprice friendly ping :) |
Hey @edenlightning, I can confirm that this is fixed for me on the reproduction script and my own codebase. Although I did run into this issue when testing out #4138 on my codebase (I have |
I think I rediscovered this bug in our examples: |
Hi. My PL version is : pytorch-lightning 1.1.3 pyhd8ed1ab_0 conda-forge I noticed that I was getting the error
And I did not get the error when I used
The difference being whether I return the dictionary or just the |
I'm getting "grad can be implicitly created only for scalar outputs" with pytorch-lightning 1.1.4, using backend='dp'. Solved it by adding a
Perhaps related - the argument
@MaveriQ's fix does not work with the return value of |
🐛 Bug
Under
backend='dp'
doesn't handle reduction of the loss across multiple GPUs correctly. This is present in v0.10--v1.0.0rc4To Reproduce
Code sample
Produces the following
Specifically note the line saying
Expected behavior
Environment
conda
,pip
, source): condaAdditional context
This works on v0.9.0:
but causes this error under v1.0.0rc4
The text was updated successfully, but these errors were encountered: