MXNetError:cudaMalloc failed: out of memory #257

vanhelsing18 · 2018-06-19T07:19:13Z

I follow your steps，but i meet this problem ,can anybody give me some solutions.

Traceback (most recent call last):
File "train_softmax.py", line 485, in
main()
File "train_softmax.py", line 482, in main
train_net(args)
File "train_softmax.py", line 476, in train_net
epoch_end_callback = epoch_cb )
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/base_module.py", line 512, in fit
self.update()
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/module.py", line 651, in update
self._kvstore, self._exec_group.param_names)
File "/usr/local/lib/python2.7/dist-packages/mxnet/model.py", line 134, in _update_params_on_kvstore
kvstore.push(name, grad_list, priority=-index)
File "/usr/local/lib/python2.7/dist-packages/mxnet/kvstore.py", line 232, in push
self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
File "/usr/local/lib/python2.7/dist-packages/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:41:11] src/storage/./pooled_storage_manager.h:108: cudaMalloc failed: out of memory

jackytu256 · 2018-06-19T08:03:37Z

perhaps you can adjust the batch size, decreasing to 64 and it may fix the problem

vanhelsing18 · 2018-06-19T08:18:38Z

@jackytu256 thank you, i have tried to , but it does no help

jackytu256 · 2018-06-19T09:35:42Z

May I know how many GPU memory you have as well as which one of algos you are trying to train?

nttstar · 2018-06-19T09:38:27Z

decrease your batch size until it can run successfully.

vanhelsing18 · 2018-06-19T09:41:01Z

@jackytu256

clhne · 2019-01-09T05:57:06Z

decrease your batch size until it can run successfully.

@nttstar
Hi, nttstar
减小batch size后，确实可以work了。
但是还有问题想请教你：
为什么减小batch size可以work呢？是因为default batch size =128时，加载训练集到GPU缓存太大了吗？还是因为GPU资源调度问题呢？
谢谢！

skyuuka · 2019-11-20T02:00:49Z

I saw otherwhere someone tried to use monger to solve the memory issue, that might be a choice, but I haven't try. Just FYI.

hyderit · 2020-04-22T23:39:33Z

I had the same issue. Decreasing the batch size fixed the problem

aliyevorkhan · 2020-10-06T10:55:41Z

Still i get same error, I decreased batch size to 2 from 32. I don't think this is the solve of problem.

nttstar closed this as completed Jun 25, 2018

This was referenced Jul 4, 2020

Fix learning rate scheduler being unexpectedly overwritten by optimizer's default value apache/mxnet#16487

Merged

out of memory issue while using mxnet with sockeye apache/mxnet#18662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MXNetError:cudaMalloc failed: out of memory #257

MXNetError:cudaMalloc failed: out of memory #257

vanhelsing18 commented Jun 19, 2018

jackytu256 commented Jun 19, 2018

vanhelsing18 commented Jun 19, 2018

jackytu256 commented Jun 19, 2018

nttstar commented Jun 19, 2018

vanhelsing18 commented Jun 19, 2018

clhne commented Jan 9, 2019

skyuuka commented Nov 20, 2019

hyderit commented Apr 22, 2020

aliyevorkhan commented Oct 6, 2020

MXNetError:cudaMalloc failed: out of memory #257

MXNetError:cudaMalloc failed: out of memory #257

Comments

vanhelsing18 commented Jun 19, 2018

jackytu256 commented Jun 19, 2018

vanhelsing18 commented Jun 19, 2018

jackytu256 commented Jun 19, 2018

nttstar commented Jun 19, 2018

vanhelsing18 commented Jun 19, 2018

clhne commented Jan 9, 2019

skyuuka commented Nov 20, 2019

hyderit commented Apr 22, 2020

aliyevorkhan commented Oct 6, 2020