Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed multiprocessing in generators. #7118

Closed
wants to merge 17 commits into from
Closed

Fixed multiprocessing in generators. #7118

wants to merge 17 commits into from

Conversation

joeyearsley
Copy link
Contributor

@joeyearsley joeyearsley commented Jun 24, 2017

Previously, applying the multiprocessing.Lock did not synchronize variables across processes, to ensure files are not duplicated in an epoch we need these variables to be synchronized.

To ensure they are synchronized the code now creates shared variables which are updated when the lock is applied. Therefore ensuring no duplicates in an epoch and batch.

I've witnessed no overhead of this lock (due to the processing on the batch and minimal time the lock is actually held by a process.

Sequences are recommended to keep order when using the prediction function since multiprocessing/threading can not guarantee threads will return in order. However, you can also use generators in your own multiprocessing pool and then use predict.

@joeyearsley joeyearsley changed the title Fixed Preprocessing to allow generators to be ordered Fixed preprocessing to ensure no duplicates in a batch when preprocessing is used. Jun 24, 2017
@joeyearsley joeyearsley changed the title Fixed preprocessing to ensure no duplicates in a batch when preprocessing is used. Fixed preprocessing to ensure no duplicates in a batch when multiproc is used. Jun 24, 2017
@ahundt
Copy link
Contributor

ahundt commented Jun 24, 2017

General thought: could all processes be given a seed and an integer ID, which they all use to determine generate their batches modulo the total number of processes?

Specific comment: make sure it is possible for the user to specify the exact seed which is used, so the randomness source used does not change if they change the data size or # of workers

@joeyearsley
Copy link
Contributor Author

joeyearsley commented Jun 24, 2017

@ahundt I looked into it. PID's aren't assigned from 0 onwards, they can be 6001-6009 (i.e. 8 workers), how would you then determine the correct modulo?
You could pass it in as an argument through the queue, however, you then are only getting pseudo random files, as each part of the list is split into N workers.
Finally, the shared variables add no visible overhead and I believe it is the correct way to ensure a random shuffle and no duplicates.

Exact seed? I've not altered the seed part from the original code, is it wrong in the original?

@joeyearsley
Copy link
Contributor Author

Closed for now as there is a heisenbug, sometimes the processes all go to sleep with no hint at an error and other times it succeeds. Only happens in the validation stage.

@joeyearsley
Copy link
Contributor Author

Found the issue, will be re-opening once I've cleaned the code up.

@joeyearsley joeyearsley reopened this Jun 28, 2017
@joeyearsley joeyearsley reopened this Jun 28, 2017
@joeyearsley joeyearsley changed the title Fixed preprocessing to ensure no duplicates in a batch when multiproc is used. Fixed multiprocessing in generators. Jun 28, 2017
@Dref360
Copy link
Contributor

Dref360 commented Jun 28, 2017

First of all great work. I just need some clarification :P

Maybe I do not understand this PR, but how does it fix multiprocessing in generators?
From what I understand, your PR is for the DirectoryIterator and not every generators right? Would there be an easy interface that we could extract from this to handle "most" generators?

@ahundt
Copy link
Contributor

ahundt commented Jun 29, 2017

assuming all processes take the same amount of computation time, then all the files should be in order.

Could you elaborate? In general this is never true, even when executing the exact same static program twice, except on very carefully set up real time systems.

@ahundt
Copy link
Contributor

ahundt commented Jun 29, 2017

My comment was to point out one of the assumptions since I happened to come across it. I see performance vs repeatability reasons for each. I think users will need different behavior depending on application, networks for research and publication vs running quick experiments on datasets for business. I don't want to prognosticate too much because I'm mostly using tfrecords right now so I don't have a specific need for this feature and would rather not ask you to write code I don't need yet, so I leave the decision up to you. :-)

@joeyearsley
Copy link
Contributor Author

joeyearsley commented Jul 2, 2017

This is now resolved.

I still prefer @Dref360 Sequences, however for the sake of fixing generators now work properly with multiprocessing and threading. To achieve this it uses far too many locks, whiles and sleeps, yet it is now fully synchronized.

It achieves the same as the sequences, but uses global variables to keep count of who is allowed into what processes (either next or placing on the queue) at what time.

@joeyearsley joeyearsley closed this Jul 2, 2017
@joeyearsley joeyearsley reopened this Jul 2, 2017
@joeyearsley
Copy link
Contributor Author

Closing this issue, even this fix doesn't guarantee ordering, just reduces the likelihood. The proper way is to use sequences.

@joeyearsley joeyearsley closed this Jul 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants