Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiprocess mechanism for Common Voice #1025

Merged
merged 22 commits into from
Jun 14, 2023
Merged

Conversation

yfyeung
Copy link
Contributor

@yfyeung yfyeung commented Apr 8, 2023

  • Add multiprocess mechanism to commonvoice recipe.
  • Use context manager to disable ffmpeg-torchaudio when process Common Voice dataset.
  • Fix bug for tsvreader.

Here is the example using num_jobs=16:

Distributing tasks: 16373it [00:00, 27021.30it/s]                                                                        | 0/1 [00:00<?, ?it/s]
Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████| 16373/16373 [00:39<00:00, 419.75it/s]
Distributing tasks: 16373it [00:00, 18163.42it/s]
Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████| 16373/16373 [00:18<00:00, 878.26it/s]
Distributing tasks: 1013969it [00:35, 28613.47it/s]
Processing: 100%|██████████████████████████████████████████████████████████████████████████████████| 1013969/1013969 [15:41<00:00, 1076.88it/s]
Spliting: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [18:57<00:00, 379.13s/it]
Processing CommonVoice languages: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [18:57<00:00, 1137.40s/it

:param lang_path: path to a CommonVoice directory for a specific language
(e.g., "/path/to/cv-corpus-13.0-2023-03-09/pl").
:param num_jobs: How many concurrent workers to use for scanning of the audio files.
:return: a tuple of (RecordingSet, SupervisionSet) objects opened in lazy mode,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is outdated with your changes because the manifests are no longer created or opened lazily here. That may make the recipe quite memory-intensive; would it be possible to keep the .open_writer() mechanism with your changes? You can open the writers like RecordingSet.open_writer() around line 206-207 and instead of appending to a list, write to disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@desh2608
Copy link
Collaborator

Even with 16 jobs, preparing CommonVoice en takes 8+ hours? That's quite surprising --- do you know where is the bottleneck?

@yfyeung
Copy link
Contributor Author

yfyeung commented Apr 17, 2023

Even with 16 jobs, preparing CommonVoice en takes 8+ hours? That's quite surprising --- do you know where is the bottleneck?

@desh2608 Please check this issue #1026.
It seems there is something wrong with Recording.from_file("*.mp3"). Without multithreading, the cpu% is over 3500%. With multithreading, no matter what num_jobs is, it takes about 8 hours.

@pzelasko @mthrok If we disable

if (is_mp3 or is_fileobj) and torchaudio_supports_ffmpeg():

method time CPU%
without ThreadPoolExecutor 4 hours 100%
with ThreadPoolExecutor, num_jobs=1 4 hours 100%
with ThreadPoolExecutor, num_jobs=2 2.3 hours 200%
with ThreadPoolExecutor, num_jobs=4 1.5 hours 350%
with ThreadPoolExecutor, num_jobs=8 50 minutes 600%
with ThreadPoolExecutor, num_jobs=16 40 minutes 800%

@pzelasko
Copy link
Collaborator

pzelasko commented Apr 17, 2023

Thanks for reporting this @yfyeung. AFAIK this mechanism was introduced becausetorchaudio.info would return invalid duration/num samples for some mp3 files and/or file objects. I think maybe the best solution here is to introduce an env var or top-level API (such aslhotse.set_ffmpeg_torchaudio_info_enabled(True/False) that lets the user control the behavior. We could disable it at the beginning of CV recipe and re-enable at the end (context manager would be great for this). Would you or @desh2608 be willing to make that change?

BTW with ProcessPoolExecutor instead of ThreadPoolExecutor you may be able to observe more linear speedup.

@@ -42,6 +46,8 @@
COMMONVOICE_SPLITS = ("train", "dev", "test", "validated", "invalidated", "other")
COMMONVOICE_DEFAULT_SPLITS = ("train", "dev", "test")

set_ffmpeg_torchaudio_info_enabled(False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be in the global scope, because importing this file is going to have a side effect. Please move this function call to the place just before RecordingSet is created inside prepare_commonvoice, and then re-enable it just after the RecordingSet is constructed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking perhaps we can make these functions possible to be used with a context manager? Would be more "pythonic" I feel.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@yfyeung yfyeung changed the title Add multithreading for Common Voice Add multiprocess mechanism for Common Voice Jun 12, 2023
@yfyeung
Copy link
Contributor Author

yfyeung commented Jun 12, 2023

@desh2608 @pzelasko This is finally ready for review. Could you please re-check the code? Thx.

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@pzelasko pzelasko enabled auto-merge (squash) June 14, 2023 15:33
@pzelasko pzelasko added this to the v1.16 milestone Jun 14, 2023
@pzelasko pzelasko merged commit 3ee82c1 into lhotse-speech:master Jun 14, 2023
@yfyeung yfyeung deleted the cv branch June 15, 2023 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants