-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-342] Fix the multi worker Dataloader #10628
[MXNET-342] Fix the multi worker Dataloader #10628
Conversation
@piiswrong if you want to review |
@@ -112,6 +113,9 @@ def default_mp_batchify_fn(data): | |||
|
|||
def worker_loop(dataset, key_queue, data_queue, batchify_fn): | |||
"""Worker loop for multiprocessing DataLoader.""" | |||
if isinstance(dataset, RecordFileDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't do this kind of special casing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you suggest to do it then?
In the same file you used several time special casing, e.g
if isinstance(data[0], nd.NDArray):
return nd.stack(*data)
elif isinstance(data[0], tuple):
data = zip(*data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piiswrong we could also have a reload
method on Dataset
that does nothing for all dataset except for the RecordFileDataset ones, that way there is no special casing.
python/mxnet/gluon/data/dataset.py
Outdated
Reload the record file. | ||
""" | ||
idx_file = os.path.splitext(self._filename)[0] + '.idx' | ||
self._record = recordio.MXIndexedRecordIO(idx_file, self._filename, 'r') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should fix this in backend. Why would forking cause a problem? File descriptors should be duplicated when forking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be the ideal solution indeed. https://groups.google.com/forum/#!topic/comp.lang.python/x-C31fCSZso
contrary to what I stated earlier, It looks like the actual problem could be that the file descriptors get closed rather than shared?
I don't see an easy way to set close_fds=False
https://docs.python.org/3/library/subprocess.html#popen-constructor since we are using the multiprocessing package rather than subprocess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't close_fds false by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the subprocess
package, it is True
by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok digging a bit more, it seems that the multiprocessing
package does not close file descriptors since it is simply calling os.fork()
. I have updated the description of the PR to reflect the issue. tldr; a file description
keeps track of the byte offset position it is in the file. When forking, all children processes get a duplicate of the original file descriptor
, however they all refer to the same file description
and when they try to move the current offset of the file description
at the same time, they cause a crash.
this is a generic issue not specific to recordiodataset. Any dataset that opens a file could be affected. Does it behave the same way if the Dataset is written in python and opens file with |
It would behave the same way if the dataset relies on reading the file at run-time. To make it clearer, instead of checking the |
Do you know if pytorch has a solution to this? Pytorch's DataLoader and dataset works pretty similarly with ours |
Just had a read through their code, I didn't see anything that would mitigate that issue. However they also don't provide a dataset that would read through a single file and use |
closing for now after @piiswrong design concerns. The bug is still present though. |
I think pytorch is also suffering from similar problems: pytorch/pytorch#973 I think we can use ForkingPickler.register(recordio.MXRecordIO, reopen_recordio) to force reload record files when workers are forked. |
Description
MXNET-342
Fix #9974
DataLoader was not compatible with ImageRecordDataset as the file descriptors
was (probably) closed on forking (close_fds=True by default).The code in DataLoader use the
multiprocessing
package, which does not close the file descriptors, as it is simply callingos.fork()
. The problem is that the recordIter is callinglseek
on the file descriptors of each fork, all pointing to the same open file description. Since all the forked processes share the same open file description, they are all trying to set the value oflseek
at the same time, thus creating a crash, as they can't be reading the same file at different position using a single open file description.see https://stackoverflow.com/questions/4277289/are-file-descriptors-shared-when-forking
and https://stackoverflow.com/questions/11733481/can-anyone-explain-a-simple-description-regarding-file-descriptor-after-fork for more information of the distinction between
file descriptor
andfile description
Now it reloads the record in each worker so that each individual process gets its own open file description.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.