Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Race condition in downloading model from model zoo in parallel #17332

Closed
eric-haibin-lin opened this issue Jan 15, 2020 · 0 comments · Fixed by #17372
Closed

Race condition in downloading model from model zoo in parallel #17332

eric-haibin-lin opened this issue Jan 15, 2020 · 0 comments · Fixed by #17372

Comments

@eric-haibin-lin
Copy link
Member

When i use horovod for training, and call

model = get_model(model_name, pretrained=True)

It complains with

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "tests/python/unittest/test_gluon_model_zoo.py", line 32, in fn
    model = get_model(model_name, pretrained=True, root='parallel_model/')
  File "/home/ubuntu/src/mxnet/python/mxnet/gluon/model_zoo/vision/__init__.py", line 152, in get_model
    return models[name](**kwargs)
  File "/home/ubuntu/src/mxnet/python/mxnet/gluon/model_zoo/vision/mobilenet.py", line 375, in mobilenet_v2_0_25
    return get_mobilenet_v2(0.25, **kwargs)
  File "/home/ubuntu/src/mxnet/python/mxnet/gluon/model_zoo/vision/mobilenet.py", line 250, in get_mobilenet_v2
    get_model_file('mobilenetv2_%s' % version_suffix, root=root), ctx=ctx)
  File "/home/ubuntu/src/mxnet/python/mxnet/gluon/model_zoo/model_store.py", line 115, in get_model_file
    os.remove(zip_file_path)
FileNotFoundError: [Errno 2] No such file or directory: 'parallel_model/mobilenetv2_0.25-ae8f9392.zip'

The get_model API breaks if multiple processes are doing it at the same time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
1 participant