Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"socket.error: [Errno 111] Connection refused" while training on ADE20K #215

Closed
miteshyh opened this issue Jul 20, 2018 · 5 comments
Closed

Comments

@miteshyh
Copy link

Hi,

I am getting following error after few data iteration @ 551/22210:

File "train.py", line 201, in
trainer.training(epoch)
File "train.py", line 142, in training
for i, (data, target) in enumerate(tbar):
File "/usr/local/lib/python2.7/dist-packages/tqdm/_tqdm.py", line 930, in iter
for obj in iterable:
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 222, in next
return self.next()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 218, in next
idx, batch = self._data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get
res = self._recv()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 88, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 53, in rebuild_ndarray
fd = multiprocessing.reduction.rebuild_handle(fd)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 156, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused

I am using latest nightly of MXNET along with Sync BatchNorm, This error comes with and without SyncBatchNorm layer.

I am using MXNET docker

Any help is much appreciated.

@miteshyh miteshyh changed the title "socket.error: [Errno 111] Connection refused" while training on ADE2K "socket.error: [Errno 111] Connection refused" while training on ADE20K Jul 20, 2018
@zhreshold
Copy link
Member

Generally speaking, it's related to failure of worker process which hangs during their job.
Two causes:

  1. shared memory not enough, check it out using df -h /dev/shm
  2. Number of opened file limit is too small, you might want to increase it, see ulimit -n

We don't have reliable reproducible way to figure out why it happens, will update once we have.

@miteshyh
Copy link
Author

miteshyh commented Jul 23, 2018

I had doubted shared Memory.

What I am observing currently is shared memory usage continuously increases with each data iteration. It never comes down. Its @ 30GB at iteration 8561 and then fails with above error. It only comes down when participating worker processes are exited.

Are we leaking resources? or closing it without committing delete?

This is not the case with stable MXNET release, there it trains same ADE20K dataset (GluonCV sample) with well under 2.5GB of shared memory.

@zhreshold
Copy link
Member

@miteshyh Thanks for reporting!

Here's what I got from my experiments.
For my workload, I have data batch 64x3x608x608, 16 workers, data queue maximum depth is num_workers x2, float32(4byte), doing the math, we have 6436086081624/1024/1024/1024=8.5GB required shared memory space.

In practise, I am getting stable ~100G during training, which is about 10 times the estimated size. So I guess it's the python's lazy garbage collection who's responsible.

Once training is finished(validation starts, where train_dataloader is expired), I am seeing the shared memory suddenly drop to 0%, so memory leak is very unlikely involved, they are simply not freed instantly.

One workaround is to reduce num_workers, and keep shared memory space large enough.
This is not a gluon-cv issue, but mxnet backend issue. I suggest you to submit an issue to incubator-mxnet, and feel free ping me in that thread, it will helps us to fix it in a timely manner!

@miteshyh
Copy link
Author

miteshyh commented Jul 24, 2018

Thanks @zhreshold,

Yes it seems to be lazy resource clean up issue. I could see file descriptors around which are marked as deleted under worker processes.

If I set num_worker to 1 then it works for me. But defiantly there is some issue as with stable mxnet release it just works fine with multiple worker processes and limited (2.5GB) shared memory.

I have raised an issue with mxnet apache/mxnet#11872 and tagged you there.

Thanks again!

@zhreshold
Copy link
Member

Let me know if it still exists after apache/mxnet#11908

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants