Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird behaviour multi-GPU (dp, gpus > 1) tensors not converted to cuda #4423

Closed
ClaartjeBarkhof opened this issue Oct 29, 2020 · 11 comments
Closed
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task strategy: dp (removed in pl) DataParallel

Comments

@ClaartjeBarkhof
Copy link

🐛 Bug

I encounter unexpected behaviour for different versions of Pytorch Lightning:

  • When using PL version 1.0.4 my train step is ignored all together. When I run with fast_dev_run my prints in training_step are not printed while the prints in validation_step are...

  • When using PL version 1.0.2 my train step is not ignored, but other unexpected stuff happens:

  1. when using 1 GPU everything is fine and my function transfer_batch_to_device(self, batch, device) is called, transferring tensors to CUDA (I still think it is strange that I have to implement this function as I thought Pytorch Lightning promises to do this for you..?)
  2. when using multiple GPUs, my function transfer_batch_to_device(self, batch, device) is not called and thus I get errors that my tensors are on cpu, where cuda is expected.

I feel it is a little unclear what functions are mandatory to implement for which behaviour. In the end I would like to use multiple GPUs on one node.

Please reproduce using the BoringModel and post here

I can't reproduce with this model as it is about the use of multiple GPUs.

To Reproduce

My code can be found on Github in this folder. My main PL code is in NewsVAE.py

Expected behavior

Environment

* CUDA:
        - GPU:
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
        - available:         True
        - version:           10.0
* Packages:
        - numpy:             1.19.1
        - pyTorch_debug:     False
        - pyTorch_version:   1.4.0
        - pytorch-lightning: 1.0.2
        - tqdm:              4.49.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - 
        - processor:         
        - python:            3.6.12
        - version:           #1 SMP Debian 4.19.152-1 (2020-10-18)

Additional context

@ClaartjeBarkhof ClaartjeBarkhof added bug Something isn't working help wanted Open to be worked on labels Oct 29, 2020
@maxjeblick
Copy link
Contributor

I had the same issue using multiple GPUs, I need to call batch_to_device explicitly in the training/validation steps.

@ClaartjeBarkhof
Copy link
Author

Hey, ah okay, great. I am doing that now which indeed gets rid of the error. Thanks @maxjeblick

New error is thrown though:

RuntimeError: grad can be implicitly created only for scalar outputs

Which probably happens because the losses at different GPUs are not combined well, making them into a vector of length number of GPUs instead of summing.

So, all in all "dp" mode does not really seem to work so well in my setting? What am I missing?

@ClaartjeBarkhof
Copy link
Author

Okay this error does not occur in "ddp" mode, so will use that now. :)

@auroracramer
Copy link

Hey, ah okay, great. I am doing that now which indeed gets rid of the error. Thanks @maxjeblick

New error is thrown though:

RuntimeError: grad can be implicitly created only for scalar outputs

Which probably happens because the losses at different GPUs are not combined well, making them into a vector of length number of GPUs instead of summing.

So, all in all "dp" mode does not really seem to work so well in my setting? What am I missing?

I'm also seeing this behavior with >1 GPU with the DP accelerator.

@auroracramer
Copy link

I'm seeing this error in PyTorch Lightning 1.0.2 though, so the cause seems to have been introduced in or before the 1.0.2 release.

@edenlightning edenlightning added this to the 1.0.x milestone Oct 29, 2020
@justusschock
Copy link
Member

@jtcramer Any chance you could provide a minimal reproduction example?

@tchaton tchaton added the waiting on author Waiting on user action, correction, or update label Nov 10, 2020
@edenlightning edenlightning modified the milestones: 1.0.x, 1.0.7 Nov 10, 2020
@jw3126
Copy link

jw3126 commented Nov 11, 2020

@Borda Borda modified the milestones: 1.0.7, 1.0.x Nov 11, 2020
@jw3126
Copy link

jw3126 commented Nov 12, 2020

Here is an MWE:

import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
import pytorch_lightning as pl

class DataModule(pl.LightningDataModule):

    def __init__(self):
        super().__init__()
    
    def setup(self, stage):
        # called on each gpu
        n = 128
        x = torch.randn(n, 1, 64,64)
        data = list(zip(x,x))
        self.test  = DataLoader(data, batch_size=32)
        self.train = DataLoader(data, batch_size=32)
        self.val   = DataLoader(data, batch_size=32)
        
    def train_dataloader(self):
        return self.train

    def val_dataloader(self):
        return self.val

    def test_dataloader(self):
        return self.test
    
class Net(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.net = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=(1,1))
    
    def forward(self, x):
        return self.net(x)
    
    def validation_step(self, batch, batch_idx):
        loss = self.compute_batch_loss(batch, batch_idx)
        self.log('val_loss', loss)
        return loss
    
    def compute_batch_loss(self, batch, batch_idx):
        x, y = batch
        y_hat = self.net(x)
        loss = F.mse_loss(y_hat, y)
        return loss

    def training_step(self, batch, batch_idx):
        loss = self.compute_batch_loss(batch, batch_idx)
        self.log('train_loss', loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

dm = DataModule()
model = Net()
trainer = pl.Trainer(gpus=4, 
                     distributed_backend="dp",
                     max_epochs=100,
                    )
trainer.fit(model, dm)

@edenlightning edenlightning removed this from the 1.0.x milestone Nov 13, 2020
@edenlightning edenlightning added priority: 1 Medium priority task and removed waiting on author Waiting on user action, correction, or update labels Nov 17, 2020
@awaelchli
Copy link
Contributor

@jw3126 your code reproduced the bug we fixed here #4138. should hopefully be in the 1.1 release

@awaelchli
Copy link
Contributor

@jtcramer @ClaartjeBarkhof if you are seeing this
RuntimeError: grad can be implicitly created only for scalar outputs
then it could mean you implemented the training_step_end maybe incorrectly or have mistaken it for training_epoch_end?

@edenlightning
Copy link
Contributor

please reopen if needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task strategy: dp (removed in pl) DataParallel
Projects
None yet
Development

No branches or pull requests

10 participants