Skip to content

Pytorch 2.1 GPU access not seen / managed by nvshare #11

@t-arsicaud-catie

Description

@t-arsicaud-catie

Hi,

I recently discovered that pytorch code such as the following :

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(5, 3).to('cuda')
criterion = nn.MSELoss().to('cuda')

inputs = torch.randn(10, 5).to('cuda')
targets = torch.randn(10, 3).to('cuda')

optimizer = optim.SGD(model.parameters(), lr=0.01)

for epoch in range(1000):
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print('Epoch %d, Loss: %.3f' % (epoch+1, loss.item()))

which is, at execution time, registered and managed by nvshare with torch==1.13.1 and torch==2.0.1, is not with torch==2.1.

Code run as expected, accessing to the GPU defined in CUDA_VISIBLE_DEVICES, but directly, bypassing the controls made by nvshare.

My test environment is the following :

  • Ubuntu 20.04 / cuda 12.2
  • nvshare compiled and installed following the recommendations in the README
  • CUDA_VISIBLE_DEVICES and LD_PRELOAD correctly set

Any idea on the reason why, and is there a way to prevent this when CUDA_VISIBLE_DEVICES and LD_PRELOAD are correctly set (in nvshare or the pytorch code) ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions