Gluon RNN memory leaks with extra variables #13951

yifeim · 2019-01-21T21:01:34Z

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

Gluon allows one to define extra variables that may not lead to model outcome. However, having them may cause memory leak.

Environment info (Required)

----------Python Info----------
Version      : 3.6.5
Compiler     : GCC 7.2.0
Build        : ('default', 'Apr 29 2018 16:14:56')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.1
Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash   : 19c501680183237d52a862e6ae1dc4ddc296305b
----------System Info----------
Platform     : Linux-4.14.77-70.82.amzn1.x86_64-x86_64-with-glibc2.9
system       : Linux
node         : ip-172-16-95-144
release      : 4.14.77-70.82.amzn1.x86_64
version      : #1 SMP Mon Dec 3 20:01:27 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2706.669
BogoMIPS:              4600.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-7
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0020 sec, LOAD: 1.0198 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0912 sec, LOAD: 0.1530 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.5845 sec, LOAD: 0.1434sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0089 sec, LOAD: 0.1170 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0100 sec, LOAD: 0.3888 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0104 sec, LOAD: 0.0782 sec.```

Package used (Python/R/Scala/Julia): Python

Error Message:

If you run watch -n0.1 nvidia-smi, you may observe memory growth every by 2MB every few seconds.

Minimum reproducible example

See mxnet-memory-leak.tar.gz
The main differences between the attachment and examples/gluon/language_model/ are to add extra on Line 56 in model.py add to add mx.nd.array([], ctx=context) on Line 166 and 183 in train.py

Steps to reproduce

(Paste the commands you ran that produced the error.)

1. python train.py --cuda --tied --nhid 200 --emsize 200 --epochs 20 --dropout 0.2 &
2. watch -n0.1 nvidia-smi

What have you tried to solve it?

Add a dummy link between all inputs and outputs. However, this may not always be possible / convenient / readable.
I previously suggested a feature request to allow None input types in the gluon models. Communicated with @szha that this would not be fundamentally challenging. However, this has not been acted upon and may be a low-hanging fruit alongside the memory fix leak.

Related: #13247

The text was updated successfully, but these errors were encountered:

piyushghai · 2019-01-22T07:13:05Z

@mxnet-label-bot Add [Gluon, Performance]

apeforest · 2019-01-22T07:58:47Z

@mxnet-label-bot add [backend, cuda]

apeforest · 2019-01-22T07:59:43Z

@yifeim I am looking into this issue.

yifeim · 2019-02-01T20:46:27Z

@apeforest Why is this not a bug?

apeforest · 2019-02-01T21:06:28Z

@yifeim Sorry, got too busy and haven't got chance to dive deep into this. Yes, I think it's a bug. @mxnet-label-bot add [Bug]

yuxihu · 2019-03-07T19:05:00Z

The memory leak is related to the extra unused variable you passed into your RNN model but it is NOT specific to RNN. In your repro script, you created a size-zero ndarray in each loop which caused the memory leak.

for epoch in range(args.epochs):
    ...
    for i, (data, target) in enumerate(train_data):
        ...
        with autograd.record():
            ....
            output, hidden = model(data, hidden, mx.nd.array([], ctx=context))

However, since the size-zero ndarray is unused anywhere, it is a better code practice to create once outside the loop and use it throughout your training. The same change applies to the eval() function in your repro script.

extra = mx.nd.array([], ctx=context)
for epoch in range(args.epochs):
    ...
    for i, (data, target) in enumerate(train_data):
        ...
        with autograd.record():
            ....
            output, hidden = model(data, hidden, extra)

With this change, I ran your repro script for 10 epochs with mxnet_cu90mkl 1.3.1 and 1.4.0 packages and did not see memory leak.

But there is indeed a memory leak issue which is the root cause for this issue. Please refer to #14358 for more details.

yuxihu · 2019-03-07T22:37:15Z

@yifeim After a little bit more digging, I think the issue is specifically related the usage of size-zero ndarray for your extra variable. If you just use mx.nd.array([1], ctx=context) as the extra variable in the loop of your repro script, you will not observe any memory leak. The true problem is creating size-zero ndarray in a loop.

yifeim · 2019-03-08T01:22:37Z

Very interesting. Thanks a lot for the insights!

lupesko · 2019-03-11T16:12:04Z

Thanks for handling @yuxihu!

yuxihu · 2019-03-20T16:03:23Z

@anirudh2290 Could you please reopen this? The original fix has been reverted due to test flakiness. I am working on alternative fix.

marcoabreu added Gluon Performance labels Jan 22, 2019

marcoabreu added Backend Issues related to the backend of MXNet CUDA labels Jan 22, 2019

marcoabreu added the Bug label Feb 1, 2019

yuxihu mentioned this issue Mar 7, 2019

Memory builds up when creating size-zero NDArray in a loop #14358

Closed

yuxihu mentioned this issue Mar 8, 2019

Fix memory leak for size-zero ndarray #14365

Merged

eric-haibin-lin closed this as completed in #14365 Mar 18, 2019

anirudh2290 reopened this Mar 20, 2019

yuxihu mentioned this issue Mar 20, 2019

Tidy up storage allocation and deallocation #14480

Merged

anirudh2290 closed this as completed in #14480 Mar 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gluon RNN memory leaks with extra variables #13951

Gluon RNN memory leaks with extra variables #13951

yifeim commented Jan 21, 2019

piyushghai commented Jan 22, 2019

apeforest commented Jan 22, 2019

apeforest commented Jan 22, 2019

yifeim commented Feb 1, 2019

apeforest commented Feb 1, 2019

yuxihu commented Mar 7, 2019 •

edited

Loading

yuxihu commented Mar 7, 2019

yifeim commented Mar 8, 2019

lupesko commented Mar 11, 2019

yuxihu commented Mar 20, 2019

Gluon RNN memory leaks with extra variables #13951

Gluon RNN memory leaks with extra variables #13951

Comments

yifeim commented Jan 21, 2019

Description

Environment info (Required)

Error Message:

Minimum reproducible example

Steps to reproduce

What have you tried to solve it?

piyushghai commented Jan 22, 2019

apeforest commented Jan 22, 2019

apeforest commented Jan 22, 2019

yifeim commented Feb 1, 2019

apeforest commented Feb 1, 2019

yuxihu commented Mar 7, 2019 • edited Loading

yuxihu commented Mar 7, 2019

yifeim commented Mar 8, 2019

lupesko commented Mar 11, 2019

yuxihu commented Mar 20, 2019

yuxihu commented Mar 7, 2019 •

edited

Loading