Add multiprocess mechanism for Common Voice #1025

yfyeung · 2023-04-08T14:40:36Z

Add multiprocess mechanism to commonvoice recipe.
Use context manager to disable ffmpeg-torchaudio when process Common Voice dataset.
Fix bug for tsvreader.

Here is the example using num_jobs=16:

Distributing tasks: 16373it [00:00, 27021.30it/s]                                                                        | 0/1 [00:00<?, ?it/s]
Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████| 16373/16373 [00:39<00:00, 419.75it/s]
Distributing tasks: 16373it [00:00, 18163.42it/s]
Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████| 16373/16373 [00:18<00:00, 878.26it/s]
Distributing tasks: 1013969it [00:35, 28613.47it/s]
Processing: 100%|██████████████████████████████████████████████████████████████████████████████████| 1013969/1013969 [15:41<00:00, 1076.88it/s]
Spliting: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [18:57<00:00, 379.13s/it]
Processing CommonVoice languages: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [18:57<00:00, 1137.40s/it

pzelasko · 2023-04-10T14:26:55Z

lhotse/recipes/commonvoice.py

+    :param lang_path: path to a CommonVoice directory for a specific language
+        (e.g., "/path/to/cv-corpus-13.0-2023-03-09/pl").
+    :param num_jobs: How many concurrent workers to use for scanning of the audio files.
+    :return: a tuple of (RecordingSet, SupervisionSet) objects opened in lazy mode,


This doc is outdated with your changes because the manifests are no longer created or opened lazily here. That may make the recipe quite memory-intensive; would it be possible to keep the .open_writer() mechanism with your changes? You can open the writers like RecordingSet.open_writer() around line 206-207 and instead of appending to a list, write to disk.

desh2608 · 2023-04-16T18:25:34Z

Even with 16 jobs, preparing CommonVoice en takes 8+ hours? That's quite surprising --- do you know where is the bottleneck?

yfyeung · 2023-04-17T02:30:03Z

Even with 16 jobs, preparing CommonVoice en takes 8+ hours? That's quite surprising --- do you know where is the bottleneck?

@desh2608 Please check this issue #1026.
It seems there is something wrong with Recording.from_file("*.mp3"). Without multithreading, the cpu% is over 3500%. With multithreading, no matter what num_jobs is, it takes about 8 hours.

@pzelasko @mthrok If we disable

lhotse/lhotse/audio.py

Line 1765 in 7edc9ae

if (is_mp3 or is_fileobj) and torchaudio_supports_ffmpeg():

method	time	CPU%
without ThreadPoolExecutor	4 hours	100%
with ThreadPoolExecutor, num_jobs=1	4 hours	100%
with ThreadPoolExecutor, num_jobs=2	2.3 hours	200%
with ThreadPoolExecutor, num_jobs=4	1.5 hours	350%
with ThreadPoolExecutor, num_jobs=8	50 minutes	600%
with ThreadPoolExecutor, num_jobs=16	40 minutes	800%

pzelasko · 2023-04-17T16:34:07Z

Thanks for reporting this @yfyeung. AFAIK this mechanism was introduced becausetorchaudio.info would return invalid duration/num samples for some mp3 files and/or file objects. I think maybe the best solution here is to introduce an env var or top-level API (such aslhotse.set_ffmpeg_torchaudio_info_enabled(True/False) that lets the user control the behavior. We could disable it at the beginning of CV recipe and re-enable at the end (context manager would be great for this). Would you or @desh2608 be willing to make that change?

BTW with ProcessPoolExecutor instead of ThreadPoolExecutor you may be able to observe more linear speedup.

Ref: #1025 (comment)

pzelasko · 2023-04-19T16:04:17Z

lhotse/recipes/commonvoice.py

@@ -42,6 +46,8 @@
 COMMONVOICE_SPLITS = ("train", "dev", "test", "validated", "invalidated", "other")
 COMMONVOICE_DEFAULT_SPLITS = ("train", "dev", "test")

+set_ffmpeg_torchaudio_info_enabled(False)


This shouldn't be in the global scope, because importing this file is going to have a side effect. Please move this function call to the place just before RecordingSet is created inside prepare_commonvoice, and then re-enable it just after the RecordingSet is constructed.

I am thinking perhaps we can make these functions possible to be used with a context manager? Would be more "pythonic" I feel.

yfyeung · 2023-06-12T09:33:33Z

@desh2608 @pzelasko This is finally ready for review. Could you please re-check the code? Thx.

pzelasko

LGTM, thanks!

Yifan Yang and others added 4 commits April 8, 2023 22:33

Add multithreading for Common Voice

ac636e5

Fix for isort

abe6c2a

Fix for black

0e5c4b3

Update commonvoice.py

b55313a

pzelasko reviewed Apr 10, 2023

View reviewed changes

Merge branch 'master' into cv

c900743

csukuangfj mentioned this pull request Apr 11, 2023

LibsndfileCompatibleAudioInfo leads to %CPU more than 100% #1026

Closed

Merge branch 'master' into cv

0877c14

csukuangfj mentioned this pull request Apr 16, 2023

Zipformer for Common Voice k2-fsa/icefall#997

Merged

Merge branch 'master' into cv

86ef204

Merge branch 'master' into cv

340d3eb

desh2608 mentioned this pull request Apr 18, 2023

API to enable/disable ffmpeg-torchaudio #1032

Merged

pzelasko added a commit that referenced this pull request Apr 18, 2023

API to enable/disable ffmpeg-torchaudio (#1032)

0f81285

Ref: #1025 (comment)

yfyeung added 6 commits April 19, 2023 00:03

Merge branch 'master' into cv

d5e9d40

Update commonvoice.py

913e42c

Disable ffmpeg-torchaudio by api

3c40377

Update commonvoice.py

eebd220

Fix for isort

2174f0a

Update commonvoice.py

30142c5

pzelasko reviewed Apr 19, 2023

View reviewed changes

yfyeung and others added 4 commits April 27, 2023 15:50

Merge branch 'lhotse-speech:master' into cv

49eccc1

Merge branch 'lhotse-speech:master' into cv

744a9c1

Merge branch 'lhotse-speech:master' into cv

94a862a

Add contextmanager

478af9c

yfyeung changed the title ~~Add multithreading for Common Voice~~ Add multiprocess mechanism for Common Voice Jun 12, 2023

Add get_ffmpeg_torchaudio_info_enabled,

43f431d

Yifan Yang added 2 commits June 12, 2023 17:57

Fix for style check

2fa67db

Fix for style check

01a3a6a

pzelasko approved these changes Jun 14, 2023

View reviewed changes

Merge branch 'master' into cv

0d0f28c

pzelasko enabled auto-merge (squash) June 14, 2023 15:33

pzelasko added this to the v1.16 milestone Jun 14, 2023

pzelasko merged commit 3ee82c1 into lhotse-speech:master Jun 14, 2023

yfyeung deleted the cv branch June 15, 2023 01:41

desh2608 mentioned this pull request Nov 1, 2023

What is the reason of disable_ffmpeg_torchaudio_info in common voice recipe #1201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiprocess mechanism for Common Voice #1025

Add multiprocess mechanism for Common Voice #1025

yfyeung commented Apr 8, 2023 •

edited

Loading

pzelasko Apr 10, 2023

yfyeung Apr 11, 2023

desh2608 commented Apr 16, 2023

yfyeung commented Apr 17, 2023 •

edited

Loading

pzelasko commented Apr 17, 2023 •

edited

Loading

pzelasko Apr 19, 2023

desh2608 Apr 19, 2023

pzelasko Apr 19, 2023

yfyeung commented Jun 12, 2023 •

edited

Loading

pzelasko left a comment

Add multiprocess mechanism for Common Voice #1025

Add multiprocess mechanism for Common Voice #1025

Conversation

yfyeung commented Apr 8, 2023 • edited Loading

pzelasko Apr 10, 2023

Choose a reason for hiding this comment

yfyeung Apr 11, 2023

Choose a reason for hiding this comment

desh2608 commented Apr 16, 2023

yfyeung commented Apr 17, 2023 • edited Loading

pzelasko commented Apr 17, 2023 • edited Loading

pzelasko Apr 19, 2023

Choose a reason for hiding this comment

desh2608 Apr 19, 2023

Choose a reason for hiding this comment

pzelasko Apr 19, 2023

Choose a reason for hiding this comment

yfyeung commented Jun 12, 2023 • edited Loading

pzelasko left a comment

Choose a reason for hiding this comment

yfyeung commented Apr 8, 2023 •

edited

Loading

yfyeung commented Apr 17, 2023 •

edited

Loading

pzelasko commented Apr 17, 2023 •

edited

Loading

yfyeung commented Jun 12, 2023 •

edited

Loading