Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MXNetError:cudaMalloc failed: out of memory #257

Closed
vanhelsing18 opened this issue Jun 19, 2018 · 9 comments
Closed

MXNetError:cudaMalloc failed: out of memory #257

vanhelsing18 opened this issue Jun 19, 2018 · 9 comments

Comments

@vanhelsing18
Copy link

I follow your steps,but i meet this problem ,can anybody give me some solutions.

Traceback (most recent call last):
File "train_softmax.py", line 485, in
main()
File "train_softmax.py", line 482, in main
train_net(args)
File "train_softmax.py", line 476, in train_net
epoch_end_callback = epoch_cb )
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/base_module.py", line 512, in fit
self.update()
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/module.py", line 651, in update
self._kvstore, self._exec_group.param_names)
File "/usr/local/lib/python2.7/dist-packages/mxnet/model.py", line 134, in _update_params_on_kvstore
kvstore.push(name, grad_list, priority=-index)
File "/usr/local/lib/python2.7/dist-packages/mxnet/kvstore.py", line 232, in push
self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
File "/usr/local/lib/python2.7/dist-packages/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:41:11] src/storage/./pooled_storage_manager.h:108: cudaMalloc failed: out of memory

@jackytu256
Copy link

perhaps you can adjust the batch size, decreasing to 64 and it may fix the problem

@vanhelsing18
Copy link
Author

@jackytu256 thank you, i have tried to , but it does no help

@jackytu256
Copy link

May I know how many GPU memory you have as well as which one of algos you are trying to train?

@nttstar
Copy link
Collaborator

nttstar commented Jun 19, 2018

decrease your batch size until it can run successfully.

@vanhelsing18
Copy link
Author

@jackytu256

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:02:00.0 Off | 0 |
| N/A 45C P0 130W / 250W | 3709MiB / 12193MiB | 83% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:82:00.0 Off | 0 |
| N/A 28C P0 32W / 250W | 10MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000000:85:00.0 Off | 0 |
| N/A 34C P0 34W / 250W | 10MiB / 16276MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 00000000:86:00.0 Off | 0 |
| N/A 34C P0 33W / 250W | 10MiB / 16276MiB | 0% Default

@nttstar nttstar closed this as completed Jun 25, 2018
@clhne
Copy link

clhne commented Jan 9, 2019

decrease your batch size until it can run successfully.

@nttstar
Hi, nttstar
减小batch size后,确实可以work了。
但是还有问题想请教你:
为什么减小batch size可以work呢?是因为default batch size =128时,加载训练集到GPU缓存太大了吗?还是因为GPU资源调度问题呢?
谢谢!

@skyuuka
Copy link

skyuuka commented Nov 20, 2019

I saw otherwhere someone tried to use monger to solve the memory issue, that might be a choice, but I haven't try. Just FYI.

@hyderit
Copy link

hyderit commented Apr 22, 2020

I had the same issue. Decreasing the batch size fixed the problem

@aliyevorkhan
Copy link

Still i get same error, I decreased batch size to 2 from 32. I don't think this is the solve of problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants