-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"socket.error: [Errno 111] Connection refused" while training on ADE20K #215
Comments
Generally speaking, it's related to failure of worker process which hangs during their job.
We don't have reliable reproducible way to figure out why it happens, will update once we have. |
I had doubted shared Memory. What I am observing currently is shared memory usage continuously increases with each data iteration. It never comes down. Its @ 30GB at iteration 8561 and then fails with above error. It only comes down when participating worker processes are exited. Are we leaking resources? or closing it without committing delete? This is not the case with stable MXNET release, there it trains same ADE20K dataset (GluonCV sample) with well under 2.5GB of shared memory. |
@miteshyh Thanks for reporting! Here's what I got from my experiments. In practise, I am getting stable ~100G during training, which is about 10 times the estimated size. So I guess it's the python's lazy garbage collection who's responsible. Once training is finished(validation starts, where train_dataloader is expired), I am seeing the shared memory suddenly drop to 0%, so memory leak is very unlikely involved, they are simply not freed instantly. One workaround is to reduce num_workers, and keep shared memory space large enough. |
Thanks @zhreshold, Yes it seems to be lazy resource clean up issue. I could see file descriptors around which are marked as deleted under worker processes. If I set num_worker to 1 then it works for me. But defiantly there is some issue as with stable mxnet release it just works fine with multiple worker processes and limited (2.5GB) shared memory. I have raised an issue with mxnet apache/mxnet#11872 and tagged you there. Thanks again! |
Let me know if it still exists after apache/mxnet#11908 |
Hi,
I am getting following error after few data iteration @ 551/22210:
File "train.py", line 201, in
trainer.training(epoch)
File "train.py", line 142, in training
for i, (data, target) in enumerate(tbar):
File "/usr/local/lib/python2.7/dist-packages/tqdm/_tqdm.py", line 930, in iter
for obj in iterable:
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 222, in next
return self.next()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 218, in next
idx, batch = self._data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get
res = self._recv()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 88, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 53, in rebuild_ndarray
fd = multiprocessing.reduction.rebuild_handle(fd)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 156, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused
I am using latest nightly of MXNET along with Sync BatchNorm, This error comes with and without SyncBatchNorm layer.
I am using MXNET docker
Any help is much appreciated.
The text was updated successfully, but these errors were encountered: