Potential Memory Leak Error #284

pandeydeep9 · 2021-11-17T22:19:21Z

I installed learn2learn using "pip install learn2learn". When I try to run maml_miniimagenet.py (from learn2learn/examples/vision/maml_miniimagenet.py ) with a batch size of 2 and shot = 1, I get the same error after 63 iterations. When I change to shot = 5, I get the error after 3 iterations.

Iteration 63
Meta Train Error 2.0417345762252808
Meta Train Accuracy 0.20000000298023224
Meta Valid Error 1.8002310991287231
Meta Valid Accuracy 0.20000000298023224
Traceback (most recent call last):
File "/home/deep/Desktop/IMPLEMENTATION/MyTry/MetaSGD/mini_Temp_Test.py", line 156, in
main()
File "/home/deep/Desktop/IMPLEMENTATION/MyTry/MetaSGD/mini_Temp_Test.py", line 106, in main
evaluation_error.backward()
File "/home/deep/.local/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/deep/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.79 GiB total capacity; 3.60 GiB already allocated; 77.56 MiB free; 3.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

When I look at nvidia-smi, the memory usage gradually increases with each iteration.
However, If I comment out the meta-validation loss part, (line 114-112 in this script) then I don't get the memory leak problem. I think the issue is similar to (Potential Memory Leak #278 ) I wonder why this issue is and how the issue can be solved?

Phoveran · 2021-11-18T01:40:49Z

Actually I was just occupied by another project, so I had not solved this but closed the issue.
Maybe the problem does exist.

seba-1511 · 2021-11-20T20:59:40Z

Thanks for raising the issue @pandeydeep9 and @Phoveran,

These leaks are worrisome. Could you share more about your setup? Which GPU, CPU, and versions of Python, PyTorch, and learn2learn? It seems to be hardward-dependent since @nightlessbaron wasn’t able to reproduce the bug on Colab. Also, are you running the mini-imagenet script as-is?

pandeydeep9 · 2021-11-21T14:38:05Z

I reduced the meta_batch_size parameter to 2 and shots to 1. That is the only change I made in the example mini-imagenet script.
My CPU is Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz. My GPU is TU106M [GeForce RTX 2060 Mobile] , version: a1, clock 33MHz.
I tried on Python 3.8.5, learn2learn version 0.1.5, and torch version is '1.10.0+cu102'

tranquangchung · 2021-11-24T03:01:11Z

I have the same issue, even I run "maml_miniimagenet.py" on A5000 with 24GB
After a few iterations, It gives the following error message "CUDA out of memory"
I try many version 0.1.3, 0.1.4, 0.1.5, 0.1.6
So I think that if I use this library for my project, I will face much trouble in the future.
Can you give me some advice or how to fix it?
Many Thanks

seba-1511 · 2021-11-24T03:23:56Z

Thanks for the additional feedback. Are you also using PyTorch v.1.10? And does commenting out the validation step also fix the memory leak?

tranquangchung · 2021-11-24T03:39:38Z

Yes, I use Pytorch 1.10.0, CUDA Version: 11.2, python 3.8.12,

Phoveran · 2021-11-24T04:30:45Z

My setting:

python: 3.9.7
learn2learn: 0.1.6 (using pip install learn2learn)
PyTorch: 1.10
CUDA: 11.4
GPU: RTX 2080Ti

seba-1511 · 2021-11-24T16:35:35Z

Thanks for the extra info, I wonder if the issue cropped up with PyTorch 1.10 on CUDA 11+. As a temporary fix, does changing learner = maml.clone() to learner = maml.clone(first_order=True) on l. 112 solve the leak for you?

pandeydeep9 · 2021-11-24T17:03:07Z

Yes, adding first_order=True on l. 112 solves the leak problem. Also, I guess this should give the expected results as I believe we can use the first order MAML during the validation/test phases and get the same results (i.e. do not need to track gradients for MAML during test/validation phases).

Thanks

ligeng0197 · 2021-12-02T14:18:56Z

We meet the same case and our setting is pytorch 1.10.0, python 3.9.5, cuda 11.5, tesla m40(24G). We are glad to see this issue published since we debug our code repeatedly and have no idea what's causing the increasing cuda memory occupation over val or test iterations.

tobiasvanderwerff · 2021-12-14T09:37:08Z

I was facing the same issue, but managed to solve it by downgrading Pytorch from version 1.10 to 1.9. I was using the following setup:

learn2learn 0.1.6
Python 3.8.6
Pytorch 1.10
GPU: Nvidia V100 (32gb)
Cuda 10.2

Using this setup, memory usage kept increasing over epochs until an out-of-memory error occurred. However, when using Pytorch 1.9, memory usage stabilizes.

sjtugzx · 2021-12-29T02:17:57Z

Thanks for the extra info, I wonder if the issue cropped up with PyTorch 1.10 on CUDA 11+. As a temporary fix, does changing learner = maml.clone() to learner = maml.clone(first_order=True) on l. 112 solve the leak for you?

Honestly, I tried to use maml for finetuning T5 transformer, befor adding "first_order=True", I just could run 2 tps, however, this way couldn't fix my problem. After adding this parameter, I could run 4 tps, but still got memory leak. I gues there are still some problems and exposed by huge networks such as transformer.

learn2learn 0.1.6
Python 3.9
Pytorch 1.10
GPU: 3080 (24GB)
Cuda 10.2

seba-1511 · 2021-12-29T10:46:44Z

The memory leak seems to have been introduced in PyTorch 1.10. @sjtugzx do you also see leaks with T5 on PyTorch 1.9?

I haven't had time to investigate it yet, so help is welcome.

kzhang2 · 2022-01-12T17:28:19Z

I have a suggestion for a potential fix. It is a little bit hacky though. In my observations, the key problem leading to the memory leak seems to be that the compute graph for the gradient update is being created, even when first_order=True. During training, I think the memory doesn't accumulate because the compute graph gets flushed when you do loss.backward(). However, at evaluation time, you never need to call loss.backward(), so there's a possibility the memory usage scales wildly.

In my code, what I've done to get rid of this extra unneeded memory usage at evaluation time is to add a eval flag to the adapt function inside MAML and MetaSGD which causes the gradient update to be wrapped in a no_grad context, so

# Update the module
self.module = maml_update(self.module, self.lr, gradients)

becomes

# Update the module
if eval:
    with torch.no_grad():
        self.module = maml_update(self.module, self.lr, gradients)
    for p in self.module.parameters():
        p.requires_grad = True
else:
    self.module = maml_update(self.module, self.lr, gradients)

I haven't investigated this in detail so I'm not sure if this is the best way to proceed, but let me know if this seems promising and if I should investigate further, and maybe even make a pull request.

seba-1511 · 2022-01-19T02:22:48Z

For people following, @kzhang2 and I have been discussing on slack and we came up with a fix. Expect a PR + release in the next 2 weeks. Meanwhile, the fix is to update the update_module function in learn2learn/utils/__init__.py as follows:

def update_module(module, updates=None, memo=None):
    r"""
    [[Source]](https://github.com/learnables/learn2learn/blob/master/learn2learn/utils.py)

    **Description**

    Updates the parameters of a module in-place, in a way that preserves differentiability.

    The parameters of the module are swapped with their update values, according to:
    \[
    p \gets p + u,
    \]
    where \(p\) is the parameter, and \(u\) is its corresponding update.


    **Arguments**

    * **module** (Module) - The module to update.
    * **updates** (list, *optional*, default=None) - A list of gradients for each parameter
        of the model. If None, will use the tensors in .update attributes.

    **Example**
    ~~~python
    error = loss(model(X), y)
    grads = torch.autograd.grad(
        error,
        model.parameters(),
        create_graph=True,
    )
    updates = [-lr * g for g in grads]
    l2l.update_module(model, updates=updates)
    ~~~
    """
    if memo is None:
        memo = {}
    if updates is not None:
        params = list(module.parameters())
        if not len(updates) == len(list(params)):
            msg = 'WARNING:update_module(): Parameters and updates have different length. ('
            msg += str(len(params)) + ' vs ' + str(len(updates)) + ')'
            print(msg)
        for p, g in zip(params, updates):
            p.update = g

    # Update the params
    for param_key in module._parameters:
        p = module._parameters[param_key]
        if p is not None and hasattr(p, 'update') and p.update is not None:
            if p in memo:
                module._parameters[param_key] = memo[p]
            else:
                updated = p + p.update
                p.update = None
                memo[p] = updated
                module._parameters[param_key] = updated

    # Second, handle the buffers if necessary
    for buffer_key in module._buffers:
        buff = module._buffers[buffer_key]
        if buff is not None and hasattr(buff, 'update') and buff.update is not None:
            if buff in memo:
                module._buffers[buffer_key] = memo[buff]
            else:
                updated = buff + buff.update
                buff.update = None
                memo[buff] = updated
                module._buffers[buffer_key] = updated

    # Then, recurse for each submodule
    for module_key in module._modules:
        module._modules[module_key] = update_module(
            module._modules[module_key],
            updates=None,
            memo=memo,
        )

    # Finally, rebuild the flattened parameters for RNNs
    # See this issue for more details:
    # https://github.com/learnables/learn2learn/issues/139
    if hasattr(module, 'flatten_parameters'):
        module._apply(lambda x: x)
    return module

seba-1511 · 2022-02-10T03:36:56Z

Quick update: this is fixed, tested, and available in the new v0.1.7 release.

aritroCoder · 2024-09-11T15:07:59Z

Hi, I am using learn2learn and getting memory leak error. This is the code I am using:

#Load model weights
model.load_state_dict(torch.load('mnist_model_weights_450.pth', map_location={'cuda:2' : 'cuda:0'}))

# run the test data
meta_test_loss = 0.0
for idx, (context_x, context_y, target_x, target_y) in enumerate(test_loader):
    context_x, context_y, target_x, target_y = context_x.to(device), context_y.to(device), target_x.to(device), target_y.to(device)
    effective_batch_size = context_x.size(0)
    for i in range(effective_batch_size):
        learner = maml.clone(first_order=True)
        x_support, y_support = context_x[i], context_y[i]
        x_query, y_query = target_x[i], target_y[i]
        y_support = y_support.view(-1)
        y_query = y_query.view(-1)
        for _ in range(num_epochs):
            wts, predictions = learner(x_support)
            loss = custom_loss_function(predictions, y_support, wts)
            learner.adapt(loss)
        wts, predictions = learner(x_query)
        loss = custom_loss_function(predictions, y_query, wts)
        meta_test_loss += loss
    meta_test_loss /= effective_batch_size
    if idx % 10 == 0:
        print(f"Iteration: {idx+1}, Meta test loss: {meta_test_loss}")
    
print(f"Final Meta test loss: {meta_test_loss}")

I am getting this error:


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 3.06 MiB is free. Process 54265 has 14.74 GiB memory in use. Of the allocated memory 14.51 GiB is allocated by PyTorch, and 102.21 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management

learn2learn 0.2.0
Python 3.9
Pytorch 2.4.0+cu121 (using google colab)
GPU: T4 (15GB)
Cuda 12.2

Can anyone tell me how to fix it? I wrote the training loop similiarly but it runs

pandeydeep9 closed this as completed Nov 24, 2021

seba-1511 reopened this Dec 2, 2021

kzhang2 mentioned this issue Feb 10, 2022

Fix memory leak #307

Merged

4 tasks

seba-1511 closed this as completed in #307 Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Memory Leak Error #284

Potential Memory Leak Error #284

pandeydeep9 commented Nov 17, 2021

Phoveran commented Nov 18, 2021

seba-1511 commented Nov 20, 2021

pandeydeep9 commented Nov 21, 2021 •

edited

Loading

tranquangchung commented Nov 24, 2021

seba-1511 commented Nov 24, 2021

tranquangchung commented Nov 24, 2021

Phoveran commented Nov 24, 2021

seba-1511 commented Nov 24, 2021

pandeydeep9 commented Nov 24, 2021

ligeng0197 commented Dec 2, 2021

tobiasvanderwerff commented Dec 14, 2021

sjtugzx commented Dec 29, 2021 •

edited

Loading

seba-1511 commented Dec 29, 2021 •

edited

Loading

kzhang2 commented Jan 12, 2022

seba-1511 commented Jan 19, 2022

seba-1511 commented Feb 10, 2022

aritroCoder commented Sep 11, 2024

Potential Memory Leak Error #284

Potential Memory Leak Error #284

Comments

pandeydeep9 commented Nov 17, 2021

Phoveran commented Nov 18, 2021

seba-1511 commented Nov 20, 2021

pandeydeep9 commented Nov 21, 2021 • edited Loading

tranquangchung commented Nov 24, 2021

seba-1511 commented Nov 24, 2021

tranquangchung commented Nov 24, 2021

Phoveran commented Nov 24, 2021

seba-1511 commented Nov 24, 2021

pandeydeep9 commented Nov 24, 2021

ligeng0197 commented Dec 2, 2021

tobiasvanderwerff commented Dec 14, 2021

sjtugzx commented Dec 29, 2021 • edited Loading

seba-1511 commented Dec 29, 2021 • edited Loading

kzhang2 commented Jan 12, 2022

seba-1511 commented Jan 19, 2022

seba-1511 commented Feb 10, 2022

aritroCoder commented Sep 11, 2024

pandeydeep9 commented Nov 21, 2021 •

edited

Loading

sjtugzx commented Dec 29, 2021 •

edited

Loading

seba-1511 commented Dec 29, 2021 •

edited

Loading