[BUG] `apply_model` fails on multi-GPU due to hardcoded CUDA device #5271

deltheil · 2024-12-13T15:20:33Z

Describe the problem

On a multi-GPU machine, using apply_model() with a HF transformer model gives a runtime error if the model is moved to another GPU than the default one.

Code to reproduce issue

import torch
import fiftyone as fo
import fiftyone.zoo as foz
from transformers import MobileNetV2ForImageClassification

dataset = foz.load_zoo_dataset("quickstart", max_samples=25)
model = MobileNetV2ForImageClassification.from_pretrained("google/mobilenet_v2_1.0_224")
assert torch.cuda.device_count() > 1
model.to("cuda:1")
dataset.apply_model(model, label_field="image_classif", skip_failures=False)

This gives a runtime error due to a device mismatch with the model preprocessor:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__cudnn_convolution)

System information

OS Platform and Distribution (e.g., Linux Ubuntu 22.04): Ubuntu 24.04 LTS
Python version (python --version): Python 3.12.7
FiftyOne version (fiftyone --version): FiftyOne v1.1.0, Voxel51, Inc.
FiftyOne installed from (pip or source): pip (via rye)

Other info/logs

After model conversion (into a FiftyOne Model), there are three occurrences of hardcoded "cuda" device like this:

fiftyone/fiftyone/utils/transformers.py

Lines 454 to 456 in e7f3edd

    
           self.device = ( 
        
               "cuda" if next(self.model.parameters()).is_cuda else "cpu" 
        
           )

And then, at predict time, self.device is used to move the preprocessed inputs on the GPU:

fiftyone/fiftyone/utils/transformers.py

Lines 702 to 705 in e7f3edd

    
           def _predict(self, inputs): 
        
               with torch.no_grad(): 
        
                   results = self.model(**inputs.to(self.device)) 
        
               return to_classification(results, self.model.config.id2label)

=> Hence the mismatch when the model has been moved to another GPU than cuda:0.

This could be replaced by self.model.device and/or, at CTOR-time, storing the attribute as self.device = self.model.device.

Willingness to contribute

The FiftyOne Community encourages bug fix contributions. Would you or another
member of your organization be willing to contribute a fix for this bug to the
FiftyOne codebase?

Yes. I can contribute a fix for this bug independently
Yes. I would be willing to contribute a fix for this bug with guidance
from the FiftyOne community
No. I cannot contribute a bug fix at this time

cc @brimoor

The text was updated successfully, but these errors were encountered:

deltheil · 2024-12-17T11:14:09Z

I just noticed the problem is not limited to HF transformers models. E.g. because of this hardcoded cuda() call here:

fiftyone/fiftyone/utils/clip/zoo.py

Lines 189 to 190 in b2734ab

    
           if self._using_gpu: 
        
               imgs = imgs.cuda()

The same problem occurs with e.g. a TorchImageModel. To reproduce it:

import fiftyone.zoo as foz
from PIL import Image

model = foz.load_zoo_model("clip-vit-base32-torch", device="cuda:1")
y = model.predict(Image.open("test.jpg"))

Fails with:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__cudnn_convolution)

deltheil added the bug Bug fixes label Dec 13, 2024

deltheil changed the title ~~[BUG] Hugging Face Transformers: apply_model error due to hardcoded CUDA device~~ [BUG] apply_model fails on multi-GPU due to hardcoded CUDA device Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `apply_model` fails on multi-GPU due to hardcoded CUDA device #5271

[BUG] `apply_model` fails on multi-GPU due to hardcoded CUDA device #5271

deltheil commented Dec 13, 2024 •

edited

Loading

deltheil commented Dec 17, 2024 •

edited

Loading

[BUG] apply_model fails on multi-GPU due to hardcoded CUDA device #5271

[BUG] apply_model fails on multi-GPU due to hardcoded CUDA device #5271

Comments

deltheil commented Dec 13, 2024 • edited Loading

Describe the problem

Code to reproduce issue

System information

Other info/logs

Willingness to contribute

deltheil commented Dec 17, 2024 • edited Loading

[BUG] `apply_model` fails on multi-GPU due to hardcoded CUDA device #5271

[BUG] `apply_model` fails on multi-GPU due to hardcoded CUDA device #5271

deltheil commented Dec 13, 2024 •

edited

Loading

deltheil commented Dec 17, 2024 •

edited

Loading