"socket.error: [Errno 111] Connection refused" while training on ADE20K #215

miteshyh · 2018-07-20T13:26:57Z

Hi,

I am getting following error after few data iteration @ 551/22210:

File "train.py", line 201, in
trainer.training(epoch)
File "train.py", line 142, in training
for i, (data, target) in enumerate(tbar):
File "/usr/local/lib/python2.7/dist-packages/tqdm/_tqdm.py", line 930, in iter
for obj in iterable:
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 222, in next
return self.next()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 218, in next
idx, batch = self._data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get
res = self._recv()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 88, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 53, in rebuild_ndarray
fd = multiprocessing.reduction.rebuild_handle(fd)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 156, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused

I am using latest nightly of MXNET along with Sync BatchNorm, This error comes with and without SyncBatchNorm layer.

I am using MXNET docker

Any help is much appreciated.

zhreshold · 2018-07-20T16:49:36Z

Generally speaking, it's related to failure of worker process which hangs during their job.
Two causes:

shared memory not enough, check it out using df -h /dev/shm
Number of opened file limit is too small, you might want to increase it, see ulimit -n

We don't have reliable reproducible way to figure out why it happens, will update once we have.

miteshyh · 2018-07-23T10:01:09Z

I had doubted shared Memory.

What I am observing currently is shared memory usage continuously increases with each data iteration. It never comes down. Its @ 30GB at iteration 8561 and then fails with above error. It only comes down when participating worker processes are exited.

Are we leaking resources? or closing it without committing delete?

This is not the case with stable MXNET release, there it trains same ADE20K dataset (GluonCV sample) with well under 2.5GB of shared memory.

zhreshold · 2018-07-23T22:29:28Z

@miteshyh Thanks for reporting!

Here's what I got from my experiments.
For my workload, I have data batch 64x3x608x608, 16 workers, data queue maximum depth is num_workers x2, float32(4byte), doing the math, we have 6436086081624/1024/1024/1024=8.5GB required shared memory space.

In practise, I am getting stable ~100G during training, which is about 10 times the estimated size. So I guess it's the python's lazy garbage collection who's responsible.

Once training is finished(validation starts, where train_dataloader is expired), I am seeing the shared memory suddenly drop to 0%, so memory leak is very unlikely involved, they are simply not freed instantly.

One workaround is to reduce num_workers, and keep shared memory space large enough.
This is not a gluon-cv issue, but mxnet backend issue. I suggest you to submit an issue to incubator-mxnet, and feel free ping me in that thread, it will helps us to fix it in a timely manner!

miteshyh · 2018-07-24T17:34:31Z

Thanks @zhreshold,

Yes it seems to be lazy resource clean up issue. I could see file descriptors around which are marked as deleted under worker processes.

If I set num_worker to 1 then it works for me. But defiantly there is some issue as with stable mxnet release it just works fine with multiple worker processes and limited (2.5GB) shared memory.

I have raised an issue with mxnet apache/mxnet#11872 and tagged you there.

Thanks again!

zhreshold · 2018-08-13T21:17:22Z

Let me know if it still exists after apache/mxnet#11908

miteshyh changed the title ~~"socket.error: [Errno 111] Connection refused" while training on ADE2K~~ "socket.error: [Errno 111] Connection refused" while training on ADE20K Jul 20, 2018

miteshyh mentioned this issue Jul 20, 2018

[MXNET-614] Adding Synchronized Batch Normalization apache/mxnet#11502

Merged

7 tasks

zhreshold mentioned this issue Jul 23, 2018

error in run SSD #212

Closed

miteshyh mentioned this issue Jul 24, 2018

"socket.error: [Errno 111] Connection refused" while training with multiple workers apache/mxnet#11872

Closed

zhreshold closed this as completed Aug 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"socket.error: [Errno 111] Connection refused" while training on ADE20K #215

"socket.error: [Errno 111] Connection refused" while training on ADE20K #215

miteshyh commented Jul 20, 2018

zhreshold commented Jul 20, 2018

miteshyh commented Jul 23, 2018 •

edited

Loading

zhreshold commented Jul 23, 2018

miteshyh commented Jul 24, 2018 •

edited

Loading

zhreshold commented Aug 13, 2018

"socket.error: [Errno 111] Connection refused" while training on ADE20K #215

"socket.error: [Errno 111] Connection refused" while training on ADE20K #215

Comments

miteshyh commented Jul 20, 2018

zhreshold commented Jul 20, 2018

miteshyh commented Jul 23, 2018 • edited Loading

zhreshold commented Jul 23, 2018

miteshyh commented Jul 24, 2018 • edited Loading

zhreshold commented Aug 13, 2018

miteshyh commented Jul 23, 2018 •

edited

Loading

miteshyh commented Jul 24, 2018 •

edited

Loading