Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on two GPUs #26

Open
AlanStark opened this issue Dec 19, 2018 · 5 comments
Open

Training on two GPUs #26

AlanStark opened this issue Dec 19, 2018 · 5 comments

Comments

@AlanStark
Copy link

Hello @qfgaohao ,

I am trying to set
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() and args.use_cuda else 'cpu')
DEVICE = torch.device('cuda:1' if torch.cuda.is_available() and args.use_cuda else 'cpu')
And run two experiments simultaneously.
The first one is working fine, and occupies a reasonable memory of GPU. But the second does not work, no matter how small the batch size is. Plus, GPU 1 has enough free memory for the another run.
Do you have any idea on this kind of issue?

@qfgaohao
Copy link
Owner

@AlanStark I didn't test the code in multiple GPU environment. https://github.com/pytorch/examples/tree/master/imagenet may be used as a reference. Good luck!

@CoskunGorkem
Copy link

Hello @AlanStark,
I have changed some lines in train_ssd.py to make all GPUs available and it worked.
You can manipulate the train and testing functions as:

def train(loader, net, criterion, optimizer, device, debug_steps=100, epoch=-1):
_net = nn.DataParallel(net)
net.train(True)

def test(loader, net, criterion, device):
net = nn.DataParallel(net)
net.eval()_

@AiueoABC
Copy link

AiueoABC commented Apr 20, 2020

Hi @Gorkem7 ,
Didn't you get this error below?

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)

I got this somehow when I tried to use your solution in vgg16-ssd training. I want to know how to fix it If you already solved this.

@donbonjenbi
Copy link

donbonjenbi commented Sep 22, 2020

Hi @Gorkem7 , @AiueoABC,

I got this same error, in vgg16-ssd training, using net = nn.DataParallel(net):

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)

Did you find a way around it?

@shiyuetianqiang
Copy link

Hello @AlanStark,
I have changed some lines in train_ssd.py to make all GPUs available and it worked.
You can manipulate the train and testing functions as:

def train(loader, net, criterion, optimizer, device, debug_steps=100, epoch=-1):
_net = nn.DataParallel(net)
net.train(True)

def test(loader, net, criterion, device):
net = nn.DataParallel(net)
net.eval()_

Hi, I follow your instructions, and I got the same error as above,
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)
did you have encounter this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants