-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[BUGFIX] fix model zoo parallel download #17336
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently when I ran the mxnet_mnist example in horovod, the data-XXX folder remains even after the training finished. Does this PR also address the auto clean up issue?
@@ -103,16 +106,16 @@ def get_model_file(name, root=os.path.join(base.data_dir(), 'models')): | |||
|
|||
util.makedirs(root) | |||
|
|||
zip_file_path = os.path.join(root, file_name+'.zip') | |||
temp_zip_file_path = os.path.join(root, file_name+random_uuid+'.zip') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to use tempfile so it takes care of clean up automatically: https://docs.python.org/3/library/tempfile.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe tempfile is typically on a different filesystem, so there is no atomic rename operation to the target directory and filename available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leezu What do you mean by a different filesystem? Regarding name, I think you can also specify a file name using the NamedTemporaryfile class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I overlooked the dir
argument.
threads = [] | ||
name = 'mobilenetv2_0.25' | ||
for _ in range(10): | ||
x = threading.Thread(target=fn, args=(name,)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use multiprocess to test? My understand is that in horovod get_model is parallelized at process level not thread level?
Moved to #17372 |
Description
Fixes #17332
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments