Potential Memory Leak #278

Phoveran · 2021-11-05T06:49:24Z

I'm running exactly examples/vision/anil_fc100.py with

learn2learn 0.1.6 (using pip install learn2learn)
pytorch 1.10
CUDA 11.4
a single RTX 2080Ti

and it crashes after 3 iterations, telling me cuda memory has run out

Iteration 0
Meta Train Error 1.496383372694254
Meta Train Accuracy 0.3362499892245978
Meta Valid Error 1.6015896834433079
Meta Valid Accuracy 0.2937499925028533
Meta Test Error 1.5358335226774216
Meta Test Accuracy 0.3562499899417162


Iteration 1
Meta Train Error 1.4316311068832874
Meta Train Accuracy 0.39999998873099685
Meta Valid Error 1.588501501828432
Meta Valid Accuracy 0.28749999054707587
Meta Test Error 1.45738809928298
Meta Test Accuracy 0.3774999915622175


Iteration 2
Meta Train Error 1.3444917295128107
Meta Train Accuracy 0.47624998819082975
Meta Valid Error 1.5741207413375378
Meta Valid Accuracy 0.28874999145045877
Meta Test Error 1.4722651988267899
Meta Test Accuracy 0.3599999900907278
Traceback (most recent call last):
  File "/home/stan/work/icml2022/test.py", line 207, in <module>
    main()
  File "/home/stan/work/icml2022/test.py", line 179, in main
    evaluation_error, evaluation_accuracy = fast_adapt(batch,
  File "/home/stan/work/icml2022/test.py", line 32, in fast_adapt
    data = features(data)
  File "/home/stan/tool/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stan/tool/miniconda3/lib/python3.9/site-packages/learn2learn/vision/models/cnn4.py", line 247, in forward
    x = super(CNN4Backbone, self).forward(x)
  File "/home/stan/tool/miniconda3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/stan/tool/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stan/tool/miniconda3/lib/python3.9/site-packages/learn2learn/vision/models/cnn4.py", line 96, in forward
    x = self.relu(x)
  File "/home/stan/tool/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stan/tool/miniconda3/lib/python3.9/site-packages/torch/nn/modules/activation.py", line 98, in forward
    return F.relu(input, inplace=self.inplace)
  File "/home/stan/tool/miniconda3/lib/python3.9/site-packages/torch/nn/functional.py", line 1299, in relu
    result = torch.relu(input)
RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 10.76 GiB total capacity; 9.25 GiB already allocated; 2.31 MiB free; 9.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CON

The text was updated successfully, but these errors were encountered:

Phoveran · 2021-11-05T06:51:09Z

Also I've checked, the memory usage increases linearly with iterations

nightlessbaron · 2021-11-05T07:43:49Z

I have reproduced the same code on [Colab Notebook], it works fine for me.
Can you try decreasing the meta batch size to 16, and see if it works?

Phoveran · 2021-11-05T07:47:48Z

I have reproduced the same code on [Colab Notebook], it works fine for me. Can you try decreasing the meta batch size to 16, and see if it works?

Thanks for your timely reply !
I've changed that even to 1, but still not working.
Maybe there's something wrong with my GPU, I'll test this on another machine and see if it works.

pandeydeep9 · 2021-11-17T21:56:56Z

I am also getting a similar error. I installed learn2learn using "pip install learn2learn". When I try to run maml_miniimagenet.py (from learn2learn/examples/vision/maml_miniimagenet.py ) with a batch size of 2 and shot = 1, I get the same error after 63 iterations. When I change to shot = 5, I get the error after 3 iterations.

Iteration 63
Meta Train Error 2.0417345762252808
Meta Train Accuracy 0.20000000298023224
Meta Valid Error 1.8002310991287231
Meta Valid Accuracy 0.20000000298023224
Traceback (most recent call last):
File "/home/deep/Desktop/IMPLEMENTATION/MyTry/MetaSGD/mini_Temp_Test.py", line 156, in
main()
File "/home/deep/Desktop/IMPLEMENTATION/MyTry/MetaSGD/mini_Temp_Test.py", line 106, in main
evaluation_error.backward()
File "/home/deep/.local/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/deep/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.79 GiB total capacity; 3.60 GiB already allocated; 77.56 MiB free; 3.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

However, If I comment out the meta-validation loss part, (line 114-112 in this script) then I don't get the memory leak problem. I wonder how the issue can be solved?

Phoveran closed this as completed Nov 7, 2021

pandeydeep9 mentioned this issue Nov 17, 2021

Potential Memory Leak Error #284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Memory Leak #278

Potential Memory Leak #278

Phoveran commented Nov 5, 2021

Phoveran commented Nov 5, 2021

nightlessbaron commented Nov 5, 2021

Phoveran commented Nov 5, 2021

pandeydeep9 commented Nov 17, 2021

Potential Memory Leak #278

Potential Memory Leak #278

Comments

Phoveran commented Nov 5, 2021

Phoveran commented Nov 5, 2021

nightlessbaron commented Nov 5, 2021

Phoveran commented Nov 5, 2021

pandeydeep9 commented Nov 17, 2021