-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LibriLight dataset #1014
LibriLight dataset #1014
Conversation
lhotse/recipes/librilight.py
Outdated
recording_id=file_name, | ||
) | ||
recordings.append(recording) | ||
segment = SupervisionSegment( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, Recording
class has a to_cut()
method which creates a cut out of the recording, so it may be okay to just return a RecordingSet
from LibriLight (like we do with MUSAN). I guess we would miss explicit speaker id information though, so this is fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking it's OK to keep the supervisions for speaker and language info, but I would set the text to None (or just avoid setting it, which results in the same).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have fixed it by avoiding setting the text field.
|
||
def prepare_librilight( | ||
corpus_dir: Pathlike, | ||
output_dir: Optional[Pathlike] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a large corpus, perhaps it would be useful to add a num_jobs
option to speed up manifest creation? Check the LibriSpeech recipe for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have implemented this.
Libri-light is a benchmark for the training of automatic speech recognition (ASR) systems with limited or no supervision. | ||
It contains a large dataset of 60K hours of unlabelled speech from audiobooks in English and a small labelled dataset (10h, 1h, and 10 min) plus metrics, trainable baseline models, and pretrained models that use these datasets. | ||
It is covered in more detail at https://arxiv.org/abs/1912.07875. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not adding download functions, could you at least provide a link to where this dataset can be obtained from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
lhotse/recipes/librilight.py
Outdated
corpus_dir = Path(corpus_dir) | ||
part_path = corpus_dir / subset | ||
audio_paths = [] | ||
for root, dirs, files in os.walk(part_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this loop be turned into a one-liner with Path(part_path).rglob("*.flac")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Please re-check the code, thx. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This recipe enables preparing LibriLight manifests.
Since LibriLight is unlabelled, no text field exists in
SupervisionSegment
.Usage: