socket.error in gluoncv ssd #197

Angzz · 2018-07-07T02:35:36Z

1.when I run ssd on a single gpu, I encounter a problem like this:

INFO:root:Namespace(batch_size=15, data_shape=512, dataset='voc', epochs=240, gpus='0', log_interval=100, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,200', momentum=0.9, network='resnet50_v1', num_workers=32, resume='', save_interval=10, save_prefix='ssd_512_resnet50_v1_voc', seed=233, start_epoch=0, val_interval=1, wd=0.0005)
INFO:root:Start training from [Epoch 0]
[02:21:20] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:[Epoch 0][Batch 99], Speed: 37.221 samples/sec, CrossEntropy=7.690, SmoothL1=3.245
INFO:root:[Epoch 0][Batch 199], Speed: 36.276 samples/sec, CrossEntropy=6.458, SmoothL1=3.105
INFO:root:[Epoch 0][Batch 299], Speed: 38.109 samples/sec, CrossEntropy=5.929, SmoothL1=2.991
Traceback (most recent call last):
File "scripts/ssd/train_ssd.py", line 259, in
train(net, train_data, val_data, eval_metric, args)
File "scripts/ssd/train_ssd.py", line 192, in train
for i, batch in enumerate(train_data):
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 222, in next
return self.next()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 218, in next
idx, batch = self._data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get
res = self._recv()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 88, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 53, in rebuild_ndarray
fd = multiprocessing.reduction.rebuild_handle(fd)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused

2.when I run it on multi gpus (like 2), the above error is not solved, and I find although you use twice batch_size compared with single gpu, the training samples/sec is not twice, in fact about 43 ~ 48samples/sec, I think this is not reasonable, imply the training process is not stable

3.the resume still not works in gluoncv, and I think it very inconvenient to train from 0 epoch each time.

Can you give me some suggestions? thx a lot!

The text was updated successfully, but these errors were encountered:

zhreshold · 2018-07-13T19:06:15Z

@Angzz I found python2 do have some stability issue during training.
I personally always use python3 for training, and do not have issues you reported.

zhreshold · 2018-08-13T21:17:48Z

Let me know if it still exists after apache/mxnet#11908

zhreshold added the bug Something isn't working label Jul 11, 2018

zhreshold closed this as completed Aug 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

socket.error in gluoncv ssd #197

socket.error in gluoncv ssd #197

Angzz commented Jul 7, 2018

zhreshold commented Jul 13, 2018

zhreshold commented Aug 13, 2018

socket.error in gluoncv ssd #197

socket.error in gluoncv ssd #197

Comments

Angzz commented Jul 7, 2018

zhreshold commented Jul 13, 2018

zhreshold commented Aug 13, 2018