Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized ReazonSpeech download speed using hf datasets features #1434

Merged
merged 2 commits into from
Dec 12, 2024

Conversation

yuta0306
Copy link
Contributor

@yuta0306 yuta0306 commented Dec 11, 2024

Optimized download speed for ReazonSpeech

The previous download codebase for ReazonSpeech is too lazy because it uses only one process to download, although we use HF datasets.

So, I optimized the codebase with HF datasets features:

  • Added options for num_proc
  • Disable decoding audio to reduce RAM and computational time

I assume downloading ReazonSpeech all (35k hrs audio) takes three days by the previous script, but we can download it in at least a day.

pzelasko
pzelasko previously approved these changes Dec 11, 2024
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!

@pzelasko
Copy link
Collaborator

Please fix the black and isort checks and LGTM

@pzelasko pzelasko added this to the v1.29.0 milestone Dec 11, 2024
@pzelasko pzelasko enabled auto-merge (squash) December 12, 2024 02:18
@pzelasko pzelasko merged commit 31dde5c into lhotse-speech:master Dec 12, 2024
9 checks passed
yfyeung pushed a commit to yfyeung/lhotse that referenced this pull request Jan 8, 2025
…tse-speech#1434)

* Optimize ReazonSpeech download speed using hf datasets features

* fix: format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants