-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Gluon RNN memory leaks with extra variables #13951
Comments
@mxnet-label-bot Add [Gluon, Performance] |
@mxnet-label-bot add [backend, cuda] |
@yifeim I am looking into this issue. |
@apeforest Why is this not a bug? |
@yifeim Sorry, got too busy and haven't got chance to dive deep into this. Yes, I think it's a bug. @mxnet-label-bot add [Bug] |
The memory leak is related to the extra unused variable you passed into your RNN model but it is NOT specific to RNN. In your repro script, you created a size-zero ndarray in each loop which caused the memory leak.
However, since the size-zero ndarray is unused anywhere, it is a better code practice to create once outside the loop and use it throughout your training. The same change applies to the eval() function in your repro script.
With this change, I ran your repro script for 10 epochs with mxnet_cu90mkl 1.3.1 and 1.4.0 packages and did not see memory leak. But there is indeed a memory leak issue which is the root cause for this issue. Please refer to #14358 for more details. |
@yifeim After a little bit more digging, I think the issue is specifically related the usage of size-zero ndarray for your extra variable. If you just use mx.nd.array([1], ctx=context) as the extra variable in the loop of your repro script, you will not observe any memory leak. The true problem is creating size-zero ndarray in a loop. |
Very interesting. Thanks a lot for the insights! |
Thanks for handling @yuxihu! |
@anirudh2290 Could you please reopen this? The original fix has been reverted due to test flakiness. I am working on alternative fix. |
Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.
For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io
Description
Gluon allows one to define extra variables that may not lead to model outcome. However, having them may cause memory leak.
Environment info (Required)
Package used (Python/R/Scala/Julia): Python
Error Message:
If you run
watch -n0.1 nvidia-smi
, you may observe memory growth every by 2MB every few seconds.Minimum reproducible example
See mxnet-memory-leak.tar.gz
The main differences between the attachment and
examples/gluon/language_model/
are to addextra
on Line 56 inmodel.py
add to addmx.nd.array([], ctx=context)
on Line 166 and 183 intrain.py
Steps to reproduce
(Paste the commands you ran that produced the error.)
What have you tried to solve it?
None
input types in thegluon
models. Communicated with @szha that this would not be fundamentally challenging. However, this has not been acted upon and may be a low-hanging fruit alongside the memory fix leak.Related: #13247
The text was updated successfully, but these errors were encountered: