-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weird behaviour multi-GPU (dp, gpus > 1) tensors not converted to cuda #4423
Comments
I had the same issue using multiple GPUs, I need to call |
Hey, ah okay, great. I am doing that now which indeed gets rid of the error. Thanks @maxjeblick New error is thrown though:
Which probably happens because the losses at different GPUs are not combined well, making them into a vector of length number of GPUs instead of summing. So, all in all "dp" mode does not really seem to work so well in my setting? What am I missing? |
Okay this error does not occur in "ddp" mode, so will use that now. :) |
I'm also seeing this behavior with >1 GPU with the DP accelerator. |
I'm seeing this error in PyTorch Lightning 1.0.2 though, so the cause seems to have been introduced in or before the 1.0.2 release. |
@jtcramer Any chance you could provide a minimal reproduction example? |
Here is an MWE: import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
import pytorch_lightning as pl
class DataModule(pl.LightningDataModule):
def __init__(self):
super().__init__()
def setup(self, stage):
# called on each gpu
n = 128
x = torch.randn(n, 1, 64,64)
data = list(zip(x,x))
self.test = DataLoader(data, batch_size=32)
self.train = DataLoader(data, batch_size=32)
self.val = DataLoader(data, batch_size=32)
def train_dataloader(self):
return self.train
def val_dataloader(self):
return self.val
def test_dataloader(self):
return self.test
class Net(pl.LightningModule):
def __init__(self):
super().__init__()
self.net = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=(1,1))
def forward(self, x):
return self.net(x)
def validation_step(self, batch, batch_idx):
loss = self.compute_batch_loss(batch, batch_idx)
self.log('val_loss', loss)
return loss
def compute_batch_loss(self, batch, batch_idx):
x, y = batch
y_hat = self.net(x)
loss = F.mse_loss(y_hat, y)
return loss
def training_step(self, batch, batch_idx):
loss = self.compute_batch_loss(batch, batch_idx)
self.log('train_loss', loss)
return loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
dm = DataModule()
model = Net()
trainer = pl.Trainer(gpus=4,
distributed_backend="dp",
max_epochs=100,
)
trainer.fit(model, dm) |
@jtcramer @ClaartjeBarkhof if you are seeing this |
please reopen if needed! |
🐛 Bug
I encounter unexpected behaviour for different versions of Pytorch Lightning:
When using PL version
1.0.4
my train step is ignored all together. When I run withfast_dev_run
my prints intraining_step
are not printed while the prints invalidation_step
are...When using PL version
1.0.2
my train step is not ignored, but other unexpected stuff happens:transfer_batch_to_device(self, batch, device)
is called, transferring tensors to CUDA (I still think it is strange that I have to implement this function as I thought Pytorch Lightning promises to do this for you..?)transfer_batch_to_device(self, batch, device)
is not called and thus I get errors that my tensors are on cpu, where cuda is expected.I feel it is a little unclear what functions are mandatory to implement for which behaviour. In the end I would like to use multiple GPUs on one node.
Please reproduce using the BoringModel and post here
I can't reproduce with this model as it is about the use of multiple GPUs.
To Reproduce
My code can be found on Github in this folder. My main PL code is in
NewsVAE.py
Expected behavior
Environment
Additional context
The text was updated successfully, but these errors were encountered: