Training on two GPUs #26

AlanStark · 2018-12-19T21:00:55Z

I am trying to set
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() and args.use_cuda else 'cpu')
DEVICE = torch.device('cuda:1' if torch.cuda.is_available() and args.use_cuda else 'cpu')
And run two experiments simultaneously.
The first one is working fine, and occupies a reasonable memory of GPU. But the second does not work, no matter how small the batch size is. Plus, GPU 1 has enough free memory for the another run.
Do you have any idea on this kind of issue?

qfgaohao · 2018-12-19T21:39:49Z

@AlanStark I didn't test the code in multiple GPU environment. https://github.com/pytorch/examples/tree/master/imagenet may be used as a reference. Good luck!

CoskunGorkem · 2020-03-09T10:21:11Z

Hello @AlanStark,
I have changed some lines in train_ssd.py to make all GPUs available and it worked.
You can manipulate the train and testing functions as:

def train(loader, net, criterion, optimizer, device, debug_steps=100, epoch=-1):
_net = nn.DataParallel(net)
net.train(True)

def test(loader, net, criterion, device):
net = nn.DataParallel(net)
net.eval()_

AiueoABC · 2020-04-20T09:14:23Z

Hi @Gorkem7 ,
Didn't you get this error below?

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)

I got this somehow when I tried to use your solution in vgg16-ssd training. I want to know how to fix it If you already solved this.

donbonjenbi · 2020-09-22T01:07:58Z

Hi @Gorkem7 , @AiueoABC,

I got this same error, in vgg16-ssd training, using net = nn.DataParallel(net):

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)

Did you find a way around it?

shiyuetianqiang · 2021-01-07T12:26:24Z

Hello @AlanStark,
I have changed some lines in train_ssd.py to make all GPUs available and it worked.
You can manipulate the train and testing functions as:

def train(loader, net, criterion, optimizer, device, debug_steps=100, epoch=-1):
_net = nn.DataParallel(net)
net.train(True)

def test(loader, net, criterion, device):
net = nn.DataParallel(net)
net.eval()_

Hi, I follow your instructions, and I got the same error as above,
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)
did you have encounter this problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on two GPUs #26

Training on two GPUs #26

AlanStark commented Dec 19, 2018

qfgaohao commented Dec 19, 2018

CoskunGorkem commented Mar 9, 2020

AiueoABC commented Apr 20, 2020 •

edited

Loading

donbonjenbi commented Sep 22, 2020 •

edited

Loading

shiyuetianqiang commented Jan 7, 2021

Training on two GPUs #26

Training on two GPUs #26

Comments

AlanStark commented Dec 19, 2018

qfgaohao commented Dec 19, 2018

CoskunGorkem commented Mar 9, 2020

AiueoABC commented Apr 20, 2020 • edited Loading

donbonjenbi commented Sep 22, 2020 • edited Loading

shiyuetianqiang commented Jan 7, 2021

AiueoABC commented Apr 20, 2020 •

edited

Loading

donbonjenbi commented Sep 22, 2020 •

edited

Loading