Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

mxnet R updated version out of memory error #11682

Closed
lichen11 opened this issue Jul 13, 2018 · 5 comments
Closed

mxnet R updated version out of memory error #11682

lichen11 opened this issue Jul 13, 2018 · 5 comments

Comments

@lichen11
Copy link

I recently updated mxnet package to 1.3.0 and keep noticing an out of memory error. The same code was run on mxnet 1.0.1 or before and had never encountered the following error.

Error in mx.nd.internal.as.array(nd) :
[00:58:56] src/storage/./pooled_storage_manager.h:118: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /home/username/R/x86_64-redhat-linux-gnu-library/3.5/mxnet/libs/libmxnet.so(dmlc::StackTrace()+0x4a) [0x7f6e319552ba]
[bt] (1) /home/username/R/x86_64-redhat-linux-gnu-library/3.5/mxnet/libs/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x21) [0x7f6e319558c1]
[bt] (2) /home/username/R/x86_64-redhat-linux-gnu-library/3.5/mxnet/libs/libmxnet.so(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle*)+0x1bb) [0x7f6e3447ed2b]
[bt] (3) /home/username/R/x86_64-redhat-linux-gnu-library/3.5/mxnet/libs/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*)+0x55) [0x7f6e344818a5]
[bt] (4) /home/username/R/x86_64-redhat-linux-gnu-library/3.5/mxnet/libs/libmxnet.so(mxnet::NDArray::CheckAndAlloc() const+0x19b) [0x7f6e31a4a7db]
[bt] (5) /home/username/R/x86_64-redhat-linux-gnu-library/3.5/mxnet/libs/libmxnet.so(+0x331257d) [0x7f6e33fc45

In the beginning when I encountered this problem, I would restart R, using q("no") and the code is able to train without the error. But now simply restarting R does not solve this issue.

Another thing I noticed is that on the official webpage mxnet is 1.2.0. But in my R sessionInfo(), I am seeing mxnet 1.3.0 is loaded.

R version 3.5.0 (2018-04-23)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base

other attached packages:
[1] caret_6.0-80 ggplot2_2.2.1 lattice_0.20-35
[4] mxnet_1.3.0 readr_1.1.1 dplyr_0.7.5
[7] imager_0.41.1 magrittr_1.5

Please provide some guidance on solving this issue. Thanks!

@zheng-da
Copy link
Contributor

how do you install mxnet 1.3.0?

can you provide some code that make mxnet run out of memory to help us reproduce the error?

@anirudhacharya
Copy link
Member

anirudhacharya commented Jul 13, 2018

@lichen11 I am assuming you have done a source build as described here, which is why it shows as mxnet 1.3 even though official release is 1.2.

Also, as zheng-da requested can you please provide a reproducible example?

@nswamy please label this - "Pending Requester Info", "R", "Memory".

@lichen11
Copy link
Author

Yes I built the GPU version from the source. As described here, https://github.com/apache/incubator-mxnet/tree/master/R-package, " bump up version number to 1.3.0 to make nightly build to build with r… ". So the mxnet R package is 1.3.0.
Here is a reproducible example https://github.com/apache/incubator-mxnet/tree/master/example/gan/CGAN_mnist_R.
I was able to run the code on previous versions. On 1.3.0, when running CGAN_train.R, I see the following errors:

Error in mx.nd.internal.as.array(nd) :
[22:09:38] /home/username/incubator-mxnet/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) Name: MapPlanKernel ErrStr:out of memory

Stack trace returned 10 entries:
[bt] (0) /home/username/R/x86_64-redhat-linux-gnu-library/3.5/mxnet/libs/libmxnet.so(dmlc::StackTrace()+0x4a) [0x7fad83b0e2ba]
[bt] (1) /home/username/R/x86_64-redhat-linux-gnu-library/3.5/mxnet/libs/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x21) [0x7fad83b0e8c1]
[bt] (2) /home/username/R/x86_64-redhat-linux-gnu-library/3.5/mxnet/libs/libmxnet.so(void mxnet::ndarray::Evalmshadow::gpu(float const&, mxnet::TBlob*, mxnet::RunContext)+0x5a3) [0x7fad87f30913]
[bt] (3) /home/username/R/x86_64-redhat-linux-gnu-library/3.5/mxnet/libs/libmxnet.so(+0x339956b) [0x7fad8620456b]
[bt] (4) /home/username/R/x86_64-redhat-linux-gnu-library/3.5/mxnet/libs/libmxnet.so(+0x37b9f89) [0x7fad86624f89]
[bt] (5) /home/username/R/x86_64-redhat-linux-gnu-library/3.5/mxnet/libs/

When you run the code on the new mxnet version, please change, in CGAN_train.R, Line 124 to

metric_D_value <- metric_D$update(as.array(mx.nd.array(rep(1, batch_size))), as.array(exec_D$ref.outputs[["D_sym_output"]]), metric_D_value)

Change Line 133 to

metric_D_value <- metric_D$update(as.array(mx.nd.array(rep(1, batch_size))), as.array(exec_D$ref.outputs[["D_sym_output"]]), metric_D_value)

@anirudhacharya
Copy link
Member

@lichen11 I am not able to reproduce this issue. I ran this tutorial multiple time with gpu and cpu context. There were a couple of other issues with the example that gets fixed in this PR - #12283

But the memory issue is not reproducible. Please let me know if you are still facing this issue.

@anirudhacharya
Copy link
Member

@sandeep-krishnamurthy please close this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants