-
Notifications
You must be signed in to change notification settings - Fork 31
Closed
Description
Hi,
I recently discovered that pytorch code such as the following :
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.Linear(5, 3).to('cuda')
criterion = nn.MSELoss().to('cuda')
inputs = torch.randn(10, 5).to('cuda')
targets = torch.randn(10, 3).to('cuda')
optimizer = optim.SGD(model.parameters(), lr=0.01)
for epoch in range(1000):
outputs = model(inputs)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('Epoch %d, Loss: %.3f' % (epoch+1, loss.item()))
which is, at execution time, registered and managed by nvshare with torch==1.13.1 and torch==2.0.1, is not with torch==2.1.
Code run as expected, accessing to the GPU defined in CUDA_VISIBLE_DEVICES, but directly, bypassing the controls made by nvshare.
My test environment is the following :
- Ubuntu 20.04 / cuda 12.2
nvsharecompiled and installed following the recommendations in the READMECUDA_VISIBLE_DEVICESandLD_PRELOADcorrectly set
Any idea on the reason why, and is there a way to prevent this when CUDA_VISIBLE_DEVICES and LD_PRELOAD are correctly set (in nvshare or the pytorch code) ?
Metadata
Metadata
Assignees
Labels
No labels