Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-342] Fix the multi worker Dataloader #10628

Closed

Conversation

ThomasDelteil
Copy link
Contributor

@ThomasDelteil ThomasDelteil commented Apr 20, 2018

Description

MXNET-342

Fix #9974

DataLoader was not compatible with ImageRecordDataset as the file descriptors was (probably) closed on forking (close_fds=True by default).

The code in DataLoader use the multiprocessing package, which does not close the file descriptors, as it is simply calling os.fork(). The problem is that the recordIter is calling lseek on the file descriptors of each fork, all pointing to the same open file description. Since all the forked processes share the same open file description, they are all trying to set the value of lseek at the same time, thus creating a crash, as they can't be reading the same file at different position using a single open file description.

see https://stackoverflow.com/questions/4277289/are-file-descriptors-shared-when-forking
and https://stackoverflow.com/questions/11733481/can-anyone-explain-a-simple-description-regarding-file-descriptor-after-fork for more information of the distinction between file descriptor and file description

Now it reloads the record in each worker so that each individual process gets its own open file description.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@ThomasDelteil
Copy link
Contributor Author

@piiswrong if you want to review

@@ -112,6 +113,9 @@ def default_mp_batchify_fn(data):

def worker_loop(dataset, key_queue, data_queue, batchify_fn):
"""Worker loop for multiprocessing DataLoader."""
if isinstance(dataset, RecordFileDataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't do this kind of special casing.

Copy link
Contributor Author

@ThomasDelteil ThomasDelteil Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you suggest to do it then?

In the same file you used several time special casing, e.g

    if isinstance(data[0], nd.NDArray):
        return nd.stack(*data)
    elif isinstance(data[0], tuple):
        data = zip(*data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piiswrong we could also have a reload method on Dataset that does nothing for all dataset except for the RecordFileDataset ones, that way there is no special casing.

Reload the record file.
"""
idx_file = os.path.splitext(self._filename)[0] + '.idx'
self._record = recordio.MXIndexedRecordIO(idx_file, self._filename, 'r')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should fix this in backend. Why would forking cause a problem? File descriptors should be duplicated when forking

Copy link
Contributor Author

@ThomasDelteil ThomasDelteil Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the ideal solution indeed. https://groups.google.com/forum/#!topic/comp.lang.python/x-C31fCSZso
contrary to what I stated earlier, It looks like the actual problem could be that the file descriptors get closed rather than shared?
I don't see an easy way to set close_fds=False https://docs.python.org/3/library/subprocess.html#popen-constructor since we are using the multiprocessing package rather than subprocess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't close_fds false by default?

Copy link
Contributor Author

@ThomasDelteil ThomasDelteil Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the subprocess package, it is True by default.

Copy link
Contributor Author

@ThomasDelteil ThomasDelteil Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok digging a bit more, it seems that the multiprocessing package does not close file descriptors since it is simply calling os.fork(). I have updated the description of the PR to reflect the issue. tldr; a file description keeps track of the byte offset position it is in the file. When forking, all children processes get a duplicate of the original file descriptor, however they all refer to the same file description and when they try to move the current offset of the file description at the same time, they cause a crash.

@piiswrong
Copy link
Contributor

this is a generic issue not specific to recordiodataset. Any dataset that opens a file could be affected. Does it behave the same way if the Dataset is written in python and opens file with open?

@ThomasDelteil
Copy link
Contributor Author

It would behave the same way if the dataset relies on reading the file at run-time.

To make it clearer, instead of checking the RecordIODataset, we could have RecordIODataset inheriting from a new abstract class FileReadingDataset for example, that documents the behavior and has an abstract method reload_file that we call on the worker loop similarly as proposed in this PR.

@piiswrong
Copy link
Contributor

Do you know if pytorch has a solution to this? Pytorch's DataLoader and dataset works pretty similarly with ours

@ThomasDelteil
Copy link
Contributor Author

Just had a read through their code, I didn't see anything that would mitigate that issue. However they also don't provide a dataset that would read through a single file and use lseek to access records a la .rec file, so it is likely that they have never encountered that problem

@ThomasDelteil
Copy link
Contributor Author

closing for now after @piiswrong design concerns. The bug is still present though.

@zhreshold
Copy link
Member

I think pytorch is also suffering from similar problems: pytorch/pytorch#973

I think we can use ForkingPickler.register(recordio.MXRecordIO, reopen_recordio) to force reload record files when workers are forked.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataLoader with workers not compatible with ImageRecordDataset
3 participants