-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multiprocessing to meeting simulation workflow #972
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
I'll keep this under WIP while I test it out on my actual simulation tasks. |
for spk_id in this_batch_spk_ids: | ||
sampler = self.samplers[spk_id] | ||
try: | ||
this_batch = next(sampler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pzelasko An issue I am facing here is that the whole sampler gets exhausted after sampling just 1 batch. I'm not sure why this is happening. Could you take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay very basic mistake --- forgot to sort the cuts by speaker id before groupby!
The simulation should be quite fast now. As an example, creating mixtures from ~3.2M source utterances (each used once) takes ~1h using 4 jobs. |
Changes in this PR
parallel_map
method.Recommendations for
num_jobs
When using the
simulate()
method in the workflow. we recommend using 1 job when the number of source utterances is small (up to 50k, for example). For larger inputs, this number can be scaled up slowly, but not more than 4-8 jobs to avoid slow-down due to multiprocessing overhead.