Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket.error in gluoncv ssd #197

Closed
Angzz opened this issue Jul 7, 2018 · 2 comments
Closed

socket.error in gluoncv ssd #197

Angzz opened this issue Jul 7, 2018 · 2 comments
Labels
bug Something isn't working

Comments

@Angzz
Copy link

Angzz commented Jul 7, 2018

1.when I run ssd on a single gpu, I encounter a problem like this:

INFO:root:Namespace(batch_size=15, data_shape=512, dataset='voc', epochs=240, gpus='0', log_interval=100, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,200', momentum=0.9, network='resnet50_v1', num_workers=32, resume='', save_interval=10, save_prefix='ssd_512_resnet50_v1_voc', seed=233, start_epoch=0, val_interval=1, wd=0.0005)
INFO:root:Start training from [Epoch 0]
[02:21:20] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:[Epoch 0][Batch 99], Speed: 37.221 samples/sec, CrossEntropy=7.690, SmoothL1=3.245
INFO:root:[Epoch 0][Batch 199], Speed: 36.276 samples/sec, CrossEntropy=6.458, SmoothL1=3.105
INFO:root:[Epoch 0][Batch 299], Speed: 38.109 samples/sec, CrossEntropy=5.929, SmoothL1=2.991
Traceback (most recent call last):
File "scripts/ssd/train_ssd.py", line 259, in
train(net, train_data, val_data, eval_metric, args)
File "scripts/ssd/train_ssd.py", line 192, in train
for i, batch in enumerate(train_data):
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 222, in next
return self.next()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 218, in next
idx, batch = self._data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get
res = self._recv()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 88, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 53, in rebuild_ndarray
fd = multiprocessing.reduction.rebuild_handle(fd)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused

2.when I run it on multi gpus (like 2), the above error is not solved, and I find although you use twice batch_size compared with single gpu, the training samples/sec is not twice, in fact about 43 ~ 48samples/sec, I think this is not reasonable, imply the training process is not stable

3.the resume still not works in gluoncv, and I think it very inconvenient to train from 0 epoch each time.

Can you give me some suggestions? thx a lot!

@zhreshold zhreshold added the bug Something isn't working label Jul 11, 2018
@zhreshold
Copy link
Member

@Angzz I found python2 do have some stability issue during training.
I personally always use python3 for training, and do not have issues you reported.

@zhreshold
Copy link
Member

Let me know if it still exists after apache/mxnet#11908

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants