-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed multiprocessing in generators. #7118
Conversation
General thought: could all processes be given a seed and an integer ID, which they all use to determine generate their batches modulo the total number of processes? Specific comment: make sure it is possible for the user to specify the exact seed which is used, so the randomness source used does not change if they change the data size or # of workers |
@ahundt I looked into it. PID's aren't assigned from 0 onwards, they can be 6001-6009 (i.e. 8 workers), how would you then determine the correct modulo? Exact seed? I've not altered the seed part from the original code, is it wrong in the original? |
Closed for now as there is a heisenbug, sometimes the processes all go to sleep with no hint at an error and other times it succeeds. Only happens in the validation stage. |
Found the issue, will be re-opening once I've cleaned the code up. |
First of all great work. I just need some clarification :P Maybe I do not understand this PR, but how does it fix multiprocessing in generators? |
Could you elaborate? In general this is never true, even when executing the exact same static program twice, except on very carefully set up real time systems. |
My comment was to point out one of the assumptions since I happened to come across it. I see performance vs repeatability reasons for each. I think users will need different behavior depending on application, networks for research and publication vs running quick experiments on datasets for business. I don't want to prognosticate too much because I'm mostly using tfrecords right now so I don't have a specific need for this feature and would rather not ask you to write code I don't need yet, so I leave the decision up to you. :-) |
This is now resolved. I still prefer @Dref360 Sequences, however for the sake of fixing generators now work properly with multiprocessing and threading. To achieve this it uses far too many locks, whiles and sleeps, yet it is now fully synchronized. It achieves the same as the sequences, but uses global variables to keep count of who is allowed into what processes (either next or placing on the queue) at what time. |
Closing this issue, even this fix doesn't guarantee ordering, just reduces the likelihood. The proper way is to use sequences. |
Previously, applying the
multiprocessing.Lock
did not synchronize variables across processes, to ensure files are not duplicated in an epoch we need these variables to be synchronized.To ensure they are synchronized the code now creates shared variables which are updated when the lock is applied. Therefore ensuring no duplicates in an epoch and batch.
I've witnessed no overhead of this lock (due to the processing on the batch and minimal time the lock is actually held by a process.
Sequences are recommended to keep order when using the prediction function since multiprocessing/threading can not guarantee threads will return in order. However, you can also use generators in your own multiprocessing pool and then use predict.