-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multiprocess mechanism for Common Voice #1025
Conversation
lhotse/recipes/commonvoice.py
Outdated
:param lang_path: path to a CommonVoice directory for a specific language | ||
(e.g., "/path/to/cv-corpus-13.0-2023-03-09/pl"). | ||
:param num_jobs: How many concurrent workers to use for scanning of the audio files. | ||
:return: a tuple of (RecordingSet, SupervisionSet) objects opened in lazy mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is outdated with your changes because the manifests are no longer created or opened lazily here. That may make the recipe quite memory-intensive; would it be possible to keep the .open_writer()
mechanism with your changes? You can open the writers like RecordingSet.open_writer()
around line 206-207 and instead of appending to a list, write to disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
Even with 16 jobs, preparing CommonVoice en takes 8+ hours? That's quite surprising --- do you know where is the bottleneck? |
@desh2608 Please check this issue #1026. @pzelasko @mthrok If we disable Line 1765 in 7edc9ae
|
Thanks for reporting this @yfyeung. AFAIK this mechanism was introduced because BTW with ProcessPoolExecutor instead of ThreadPoolExecutor you may be able to observe more linear speedup. |
lhotse/recipes/commonvoice.py
Outdated
@@ -42,6 +46,8 @@ | |||
COMMONVOICE_SPLITS = ("train", "dev", "test", "validated", "invalidated", "other") | |||
COMMONVOICE_DEFAULT_SPLITS = ("train", "dev", "test") | |||
|
|||
set_ffmpeg_torchaudio_info_enabled(False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be in the global scope, because importing this file is going to have a side effect. Please move this function call to the place just before RecordingSet is created inside prepare_commonvoice
, and then re-enable it just after the RecordingSet is constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking perhaps we can make these functions possible to be used with a context manager? Would be more "pythonic" I feel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Here is the example using
num_jobs=16
: