Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiprocessing to meeting simulation workflow #972

Merged
merged 9 commits into from
Feb 9, 2023

Conversation

desh2608
Copy link
Collaborator

@desh2608 desh2608 commented Feb 8, 2023

Changes in this PR

  • Refactor meeting simulation workflow to put utterance group creation as a common sampler.
  • Add option for using multiple workers to mix the sampled utterance groups. Can potentially speed up ~2-3x.
  • Fix parallel_map method.

Recommendations for num_jobs

When using the simulate() method in the workflow. we recommend using 1 job when the number of source utterances is small (up to 50k, for example). For larger inputs, this number can be scaled up slowly, but not more than 4-8 jobs to avoid slow-down due to multiprocessing overhead.

pzelasko
pzelasko previously approved these changes Feb 8, 2023
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

@pzelasko pzelasko added this to the v1.13 milestone Feb 8, 2023
@desh2608
Copy link
Collaborator Author

desh2608 commented Feb 8, 2023

Very cool!

I'll keep this under WIP while I test it out on my actual simulation tasks.

@desh2608 desh2608 changed the title Add mulitprocessing to meeting simulation workflow [WIP] Add mulitprocessing to meeting simulation workflow Feb 8, 2023
@desh2608 desh2608 changed the title [WIP] Add mulitprocessing to meeting simulation workflow [WIP] Add multiprocessing to meeting simulation workflow Feb 8, 2023
for spk_id in this_batch_spk_ids:
sampler = self.samplers[spk_id]
try:
this_batch = next(sampler)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pzelasko An issue I am facing here is that the whole sampler gets exhausted after sampling just 1 batch. I'm not sure why this is happening. Could you take a look?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay very basic mistake --- forgot to sort the cuts by speaker id before groupby!

@desh2608 desh2608 changed the title [WIP] Add multiprocessing to meeting simulation workflow Add multiprocessing to meeting simulation workflow Feb 8, 2023
@desh2608
Copy link
Collaborator Author

desh2608 commented Feb 8, 2023

The simulation should be quite fast now. As an example, creating mixtures from ~3.2M source utterances (each used once) takes ~1h using 4 jobs.

@desh2608 desh2608 merged commit a418912 into lhotse-speech:master Feb 9, 2023
@desh2608 desh2608 deleted the sim_nj branch November 2, 2023 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants