-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error=101 : invalid device ordinal #3791
Comments
I managed to fix it by adding the following line |
hey! which version of lightning are you using? |
@edenlightning I took the most recent one from |
this code is old :) Try the version on master. As you see, this has been fixed. |
Tested with this command and it worked: CUDA_VISIBLE_DEVICES='2,3' python pl_examples/basic_examples/autoencoder.py --distributed_backend 'ddp' --gpus '0, 1' |
I'm having this same problem still. I'm on
However, if I run like @williamFalcon did, with CUDA_VISIBLE_DEVICES='0,2' python train.py --gpus 0,1 --distributed_backend ddp then it runs just fine. However, I thought part of the purpose of the |
it’s not required to specify visible devices... i only did it because that was your example. gpus should be a string when called via CLI “0, 2” |
Right. I was just pointing out that when I explicitly specify CUDA_VISIBLE_DEVICES it works, but if I try to specify a subset of GPUs to use without setting CUDA_VISIBLE_DEVICES, it gives me an error; both with or without quotes around the GPU list. |
Got it. Ok, i think something is getting lost in translation haha. Could you please:
Thank you! Very excited to track this down :) |
It won't work on # -*- coding: utf-8 -*-
"""The BoringModel.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3
# The Boring Model
Replicate a bug you experience, using this model.
[Remember! we're always available for support on Slack](https://join.slack.com/t/pytorch-lightning/shared_invite/zt-f6bl2l0l-JYMK3tbAgAmGRrlNr00f1A)
---
## Setup env
"""
lightning_version = '1.0.0' #@param ["1.0.0", "0.10.0", "0.9.0", "0.8.5", "0.8.0"]
"""---
## Deps
"""
import os
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split, Dataset
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl
from pytorch_lightning.metrics.functional import accuracy
tmpdir = os.getcwd()
"""---
## Data
Random data is best for debugging. If you needs special tensor shapes or batch compositions or dataloaders, modify as needed
"""
# some other options for random data
from pl_bolts.datasets import RandomDataset, DummyDataset, RandomDictDataset
class RandomDataset(Dataset):
def __init__(self, size, num_samples):
self.len = num_samples
self.data = torch.randn(num_samples, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
num_samples = 10000
train = RandomDataset(32, num_samples)
train = DataLoader(train, batch_size=32)
val = RandomDataset(32, num_samples)
val = DataLoader(val, batch_size=32)
test = RandomDataset(32, num_samples)
test = DataLoader(test, batch_size=32)
"""---
## Model
Modify this as needed to replicate your bug
"""
import torch
from pytorch_lightning import LightningModule
from torch.utils.data import Dataset
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def loss(self, batch, prediction):
# An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))
def training_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"x": loss}
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
return [optimizer], [lr_scheduler]
"""---
## Define the test
NOTE: in colab, set progress_bar_refresh_rate high or the screen will freeze because of the rapid tqdm update speed.
"""
def test_x(tmpdir):
import argparse
parser = argparse.ArgumentParser()
parser = pl.Trainer.add_argparse_args(parser)
args = parser.parse_args()
# init model
model = BoringModel()
# Initialize a trainer
trainer = pl.Trainer.from_argparse_args(
args,
max_epochs=1,
)
# Train the model ⚡
trainer.fit(model, train, val)
"""---
## Run Test
"""
test_x(tmpdir) This command runs successfully:
This one does not:
It gives the error I showed in my last post. |
I'm working on this but no big breakthrough yet. I'm facing some difficulties because there are several global/env variables that determine the GPU selection. For ddp this is quite difficult to debug. |
🐛 Bug
When the first entry of CUDA_VISIBLE_DEVICES > 0 and the
ddp
backend is used, a cuda errorinvalid device ordinal
occurs.After digging into the library's code I found the source of the issue in this function:
https://github.com/PyTorchLightning/pytorch-lightning/blob/440f837f6d1b5fc44e6f04475fd2af20e2ed370d/pytorch_lightning/accelerators/ddp_backend.py#L151
which I copy and paste here:
Assume
is_master=True
,self.trainer.local_rank=0
andCUDA_VISIBLE_DEVICES=4,5,6,7
from the environment.Then in line 4
gpu_idx
becomes4
. In line 5gpu_idx
remains unchanged becausePL_DDP_PID
is not defined.Finally in line 7 we get the error because the device indexing has to be in the range [0,4). but we try to set the device to 4.
The problem here is that gpu_idx is taken in absolute terms in line 4, while it should use a relative indexing in line 7.
The text was updated successfully, but these errors were encountered: