Weird behaviour multi-GPU (dp, gpus > 1) tensors not converted to cuda #4423

ClaartjeBarkhof · 2020-10-29T09:26:32Z

🐛 Bug

I encounter unexpected behaviour for different versions of Pytorch Lightning:

When using PL version 1.0.4 my train step is ignored all together. When I run with fast_dev_run my prints in training_step are not printed while the prints in validation_step are...
When using PL version 1.0.2 my train step is not ignored, but other unexpected stuff happens:

when using 1 GPU everything is fine and my function transfer_batch_to_device(self, batch, device) is called, transferring tensors to CUDA (I still think it is strange that I have to implement this function as I thought Pytorch Lightning promises to do this for you..?)
when using multiple GPUs, my function transfer_batch_to_device(self, batch, device) is not called and thus I get errors that my tensors are on cpu, where cuda is expected.

I feel it is a little unclear what functions are mandatory to implement for which behaviour. In the end I would like to use multiple GPUs on one node.

Please reproduce using the BoringModel and post here

I can't reproduce with this model as it is about the use of multiple GPUs.

To Reproduce

My code can be found on Github in this folder. My main PL code is in NewsVAE.py

Expected behavior

Environment

* CUDA:
        - GPU:
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
        - available:         True
        - version:           10.0
* Packages:
        - numpy:             1.19.1
        - pyTorch_debug:     False
        - pyTorch_version:   1.4.0
        - pytorch-lightning: 1.0.2
        - tqdm:              4.49.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - 
        - processor:         
        - python:            3.6.12
        - version:           #1 SMP Debian 4.19.152-1 (2020-10-18)

Additional context

The text was updated successfully, but these errors were encountered:

maxjeblick · 2020-10-29T11:02:31Z

I had the same issue using multiple GPUs, I need to call batch_to_device explicitly in the training/validation steps.

ClaartjeBarkhof · 2020-10-29T12:58:14Z

Hey, ah okay, great. I am doing that now which indeed gets rid of the error. Thanks @maxjeblick

New error is thrown though:

RuntimeError: grad can be implicitly created only for scalar outputs

Which probably happens because the losses at different GPUs are not combined well, making them into a vector of length number of GPUs instead of summing.

So, all in all "dp" mode does not really seem to work so well in my setting? What am I missing?

ClaartjeBarkhof · 2020-10-29T13:39:08Z

Okay this error does not occur in "ddp" mode, so will use that now. :)

auroracramer · 2020-10-29T20:29:41Z

Hey, ah okay, great. I am doing that now which indeed gets rid of the error. Thanks @maxjeblick

New error is thrown though:
RuntimeError: grad can be implicitly created only for scalar outputs
Which probably happens because the losses at different GPUs are not combined well, making them into a vector of length number of GPUs instead of summing.

So, all in all "dp" mode does not really seem to work so well in my setting? What am I missing?

I'm also seeing this behavior with >1 GPU with the DP accelerator.

auroracramer · 2020-10-29T20:32:34Z

I'm seeing this error in PyTorch Lightning 1.0.2 though, so the cause seems to have been introduced in or before the 1.0.2 release.

justusschock · 2020-11-09T09:27:29Z

@jtcramer Any chance you could provide a minimal reproduction example?

jw3126 · 2020-11-11T06:18:55Z

Probably related:
https://discuss.pytorch.org/t/pytorch-lightning-tensors-on-wrong-device/102285

jw3126 · 2020-11-12T07:58:39Z

Here is an MWE:

import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
import pytorch_lightning as pl

class DataModule(pl.LightningDataModule):

    def __init__(self):
        super().__init__()
    
    def setup(self, stage):
        # called on each gpu
        n = 128
        x = torch.randn(n, 1, 64,64)
        data = list(zip(x,x))
        self.test  = DataLoader(data, batch_size=32)
        self.train = DataLoader(data, batch_size=32)
        self.val   = DataLoader(data, batch_size=32)
        
    def train_dataloader(self):
        return self.train

    def val_dataloader(self):
        return self.val

    def test_dataloader(self):
        return self.test
    
class Net(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.net = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=(1,1))
    
    def forward(self, x):
        return self.net(x)
    
    def validation_step(self, batch, batch_idx):
        loss = self.compute_batch_loss(batch, batch_idx)
        self.log('val_loss', loss)
        return loss
    
    def compute_batch_loss(self, batch, batch_idx):
        x, y = batch
        y_hat = self.net(x)
        loss = F.mse_loss(y_hat, y)
        return loss

    def training_step(self, batch, batch_idx):
        loss = self.compute_batch_loss(batch, batch_idx)
        self.log('train_loss', loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

dm = DataModule()
model = Net()
trainer = pl.Trainer(gpus=4, 
                     distributed_backend="dp",
                     max_epochs=100,
                    )
trainer.fit(model, dm)

awaelchli · 2020-12-04T17:47:56Z

@jw3126 your code reproduced the bug we fixed here #4138. should hopefully be in the 1.1 release

awaelchli · 2020-12-04T17:51:08Z

@jtcramer @ClaartjeBarkhof if you are seeing this
RuntimeError: grad can be implicitly created only for scalar outputs
then it could mean you implemented the training_step_end maybe incorrectly or have mistaken it for training_epoch_end?

edenlightning · 2021-02-16T18:35:31Z

please reopen if needed!

ClaartjeBarkhof added bug Something isn't working help wanted Open to be worked on labels Oct 29, 2020

ydcjeff added the strategy: dp (removed in pl) DataParallel label Oct 29, 2020

edenlightning added this to the 1.0.x milestone Oct 29, 2020

edenlightning assigned justusschock Nov 3, 2020

tchaton added the waiting on author Waiting on user action, correction, or update label Nov 10, 2020

edenlightning modified the milestones: 1.0.x, 1.0.7 Nov 10, 2020

Borda modified the milestones: 1.0.7, 1.0.x Nov 11, 2020

edenlightning removed this from the 1.0.x milestone Nov 13, 2020

edenlightning added priority: 1 Medium priority task and removed waiting on author Waiting on user action, correction, or update labels Nov 17, 2020

edenlightning closed this as completed Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird behaviour multi-GPU (dp, gpus > 1) tensors not converted to cuda #4423

Weird behaviour multi-GPU (dp, gpus > 1) tensors not converted to cuda #4423

ClaartjeBarkhof commented Oct 29, 2020

maxjeblick commented Oct 29, 2020

ClaartjeBarkhof commented Oct 29, 2020

ClaartjeBarkhof commented Oct 29, 2020

auroracramer commented Oct 29, 2020

auroracramer commented Oct 29, 2020

justusschock commented Nov 9, 2020

jw3126 commented Nov 11, 2020

jw3126 commented Nov 12, 2020

awaelchli commented Dec 4, 2020

awaelchli commented Dec 4, 2020

edenlightning commented Feb 16, 2021

Weird behaviour multi-GPU (dp, gpus > 1) tensors not converted to cuda #4423

Weird behaviour multi-GPU (dp, gpus > 1) tensors not converted to cuda #4423

Comments

ClaartjeBarkhof commented Oct 29, 2020

🐛 Bug

Please reproduce using the BoringModel and post here

To Reproduce

Expected behavior

Environment

Additional context

maxjeblick commented Oct 29, 2020

ClaartjeBarkhof commented Oct 29, 2020

ClaartjeBarkhof commented Oct 29, 2020

auroracramer commented Oct 29, 2020

auroracramer commented Oct 29, 2020

justusschock commented Nov 9, 2020

jw3126 commented Nov 11, 2020

jw3126 commented Nov 12, 2020

awaelchli commented Dec 4, 2020

awaelchli commented Dec 4, 2020

edenlightning commented Feb 16, 2021