cudaMalloc error out of memory #3642

turovapolina · 2021-12-18T09:01:46Z

turovapolina
Dec 18, 2021

Summary

When I am using pytorch-metric-learning package which refers to faiss I am getting an error
Error: 'err == cudaSuccess' failed: StandardGpuResources: alloc fail type TemporaryMemoryBuffer dev 0 space Device stream 0x56211a0ab660 size 1610612736 bytes (cudaMalloc error out of memory [2])
Probably this question has been asked several times before but I didn't find any working instructions or explanations for this. I found in the documentation that temporary memory never exceeds 1.5 Gb (https://faiss.ai/cpp_api/class/classfaiss_1_1gpu_1_1StandardGpuResourcesImpl.html#_CPPv4N5faiss3gpu24StandardGpuResourcesImpl13setTempMemoryE6size_t) but I just want to be sure that nothing can be done to extend this or somehow solve this problem.

Platform

OS: macOS, but calculations are performed on Google Colab Pro+

Faiss version: 1.7.1

Installed from: pip

Faiss compilation options:

Running on:

CPU
GPU

Interface:

C++
Python

Reproduction instructions

I am using pytorch-metric-learning and at the stage of test I am using accuracy_calculator.get_accuracy function (https://kevinmusgrave.github.io/pytorch-metric-learning/accuracy_calculation/). As an architecture I am using ResNet18 with modified lats fc layer (from 1000 to 32).

When this function is called I have this error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-25-f624cfee57e5> in <module>()
      1 for epoch in range(1, num_epochs + 1):
      2     train(pretrained, loss_func, mining_func, device, train_dataloader, optimizer, epoch)
----> 3     test(training_data, test_data, pretrained, accuracy_calculator)

8 frames
<ipython-input-22-ac1570fcb986> in test(train_set, test_set, model, accuracy_calculator)
     33     print("Computing accuracy")
     34     accuracies = accuracy_calculator.get_accuracy(
---> 35         test_embeddings, train_embeddings, test_labels, train_labels, False
     36     )
     37     print("Test set accuracy (Precision@1) = {}".format(accuracies["precision_at_1"]))

/usr/local/lib/python3.7/dist-packages/pytorch_metric_learning/utils/accuracy_calculator.py in get_accuracy(self, query, reference, query_labels, reference_labels, embeddings_come_from_same_source, include, exclude)
    403 
    404             knn_distances, knn_indices = self.knn_func(
--> 405                 query, num_k, reference, embeddings_come_from_same_source
    406             )
    407 

/usr/local/lib/python3.7/dist-packages/pytorch_metric_learning/utils/inference.py in __call__(self, query, k, reference, embeddings_come_from_same_source)
    189             k,
    190             is_cuda,
--> 191             self.gpus,
    192         )
    193         distances = c_f.to_device(distances, device=device)

/usr/local/lib/python3.7/dist-packages/pytorch_metric_learning/utils/inference.py in try_gpu(index, query, reference, k, is_cuda, gpus)
    260         max_k_for_gpu = 1024 if float(torch.version.cuda) < 9.5 else 2048
    261         if k <= max_k_for_gpu:
--> 262             gpu_index = convert_to_gpu_index(index, gpus)
    263     try:
    264         return add_to_index_and_search(gpu_index, query, reference, k)

/usr/local/lib/python3.7/dist-packages/pytorch_metric_learning/utils/inference.py in convert_to_gpu_index(index, gpus)
    242         return index
    243     if gpus is None:
--> 244         return faiss.index_cpu_to_all_gpus(index)
    245     return faiss.index_cpu_to_gpus_list(index, gpus=gpus)
    246 

/usr/local/lib/python3.7/dist-packages/faiss/__init__.py in index_cpu_to_all_gpus(index, co, ngpu)
    864 
    865 def index_cpu_to_all_gpus(index, co=None, ngpu=-1):
--> 866     index_gpu = index_cpu_to_gpus_list(index, co=co, gpus=None, ngpu=ngpu)
    867     return index_gpu
    868 

/usr/local/lib/python3.7/dist-packages/faiss/__init__.py in index_cpu_to_gpus_list(index, co, gpus, ngpu)
    876         gpus = range(ngpu)
    877     res = [StandardGpuResources() for _ in gpus]
--> 878     index_gpu = index_cpu_to_gpu_multiple_py(res, index, co, gpus)
    879     return index_gpu
    880 

/usr/local/lib/python3.7/dist-packages/faiss/__init__.py in index_cpu_to_gpu_multiple_py(resources, index, co, gpus)
    859         vdev.push_back(i)
    860         vres.push_back(res)
--> 861     index = index_cpu_to_gpu_multiple(vres, vdev, index, co)
    862     return index
    863 

/usr/local/lib/python3.7/dist-packages/faiss/swigfaiss.py in index_cpu_to_gpu_multiple(provider, devices, index, options)
   6640 
   6641 def index_cpu_to_gpu_multiple(provider, devices, index, options=None):
-> 6642     return _swigfaiss.index_cpu_to_gpu_multiple(provider, devices, index, options)
   6643 class GpuProgressiveDimIndexFactory(ProgressiveDimIndexFactory):
   6644     thisown = property(lambda x: x.this.own(), lambda x, v: x.this.own(v), doc="The membership flag")

RuntimeError: Error in virtual void* faiss::gpu::StandardGpuResourcesImpl::allocMemory(const faiss::gpu::AllocRequest&) at /project/faiss/faiss/gpu/StandardGpuResources.cpp:452: Error: 'err == cudaSuccess' failed: StandardGpuResources: alloc fail type TemporaryMemoryBuffer dev 0 space Device stream 0x56211a0ab660 size 1610612736 bytes (cudaMalloc error out of memory [2])

Size of samples in my dataset is 750x750 and I understand that this might me too large. When I am compressing them to 500x500 code works well without this error.
However, these samples are not pictures, they represent results of some chemical experiment (spectra) and I am highly interested in the procedure where I am not compressing them because every point represents specific measurement.
I am using Google Colab Pro+ version and my runtime has 54.8 gigabytes of available RAM.
So for me it seems reasonable to extend size of my temporary memory if it is possible at all.

Or probably I misunderstand the whole situation and if it is so please correct me. I will be glad to provide any other details.

Thank you very much in advance!

Best regards,
Polina

mdouze · 2021-12-25T09:45:40Z

mdouze
Dec 25, 2021
Collaborator

I don't understand what the sample size means. What is the total size of the dataset that must be moved to GPU? If the size exceeds GPU RAM but by less than a factor 2 it may be useful to encode it in 16-bit floats, see

https://github.com/facebookresearch/faiss/blob/main/faiss/gpu/GpuClonerOptions.h

useFloat16

0 replies

Loose-Gu · 2022-10-05T06:21:38Z

Loose-Gu
Oct 5, 2022

I don't understand what the sample size means. What is the total size of the dataset that must be moved to GPU? If the size exceeds GPU RAM but by less than a factor 2 it may be useful to encode it in 16-bit floats, see

https://github.com/facebookresearch/faiss/blob/main/faiss/gpu/GpuClonerOptions.h

useFloat16

can i use float16 in python version when indexing the dataset to gpu?

0 replies

spookyQubit · 2023-08-23T03:52:34Z

spookyQubit
Aug 23, 2023

@turovapolina , did you find a solution to get around the Error: 'err == cudaSuccess' failed: StandardGpuResources: alloc fail type TemporaryMemoryBuffer error?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudaMalloc error out of memory #3642

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

cudaMalloc error out of memory #3642

turovapolina Dec 18, 2021

Summary

Platform

Reproduction instructions

Replies: 3 comments

mdouze Dec 25, 2021 Collaborator

Loose-Gu Oct 5, 2022

spookyQubit Aug 23, 2023

turovapolina
Dec 18, 2021

mdouze
Dec 25, 2021
Collaborator

Loose-Gu
Oct 5, 2022

spookyQubit
Aug 23, 2023