Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LibriLight dataset #1014

Merged
merged 7 commits into from
Apr 1, 2023
Merged

LibriLight dataset #1014

merged 7 commits into from
Apr 1, 2023

Conversation

yfyeung
Copy link
Contributor

@yfyeung yfyeung commented Mar 30, 2023

This recipe enables preparing LibriLight manifests.
Since LibriLight is unlabelled, no text field exists in SupervisionSegment.

Usage:

lhotse prepare librilight <librilight_dir> <output_dir>  -j <num_jobs>

recording_id=file_name,
)
recordings.append(recording)
segment = SupervisionSegment(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, Recording class has a to_cut() method which creates a cut out of the recording, so it may be okay to just return a RecordingSet from LibriLight (like we do with MUSAN). I guess we would miss explicit speaker id information though, so this is fine too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking it's OK to keep the supervisions for speaker and language info, but I would set the text to None (or just avoid setting it, which results in the same).

Copy link
Contributor Author

@yfyeung yfyeung Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have fixed it by avoiding setting the text field.


def prepare_librilight(
corpus_dir: Pathlike,
output_dir: Optional[Pathlike] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a large corpus, perhaps it would be useful to add a num_jobs option to speed up manifest creation? Check the LibriSpeech recipe for an example.

Copy link
Contributor Author

@yfyeung yfyeung Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have implemented this.

Libri-light is a benchmark for the training of automatic speech recognition (ASR) systems with limited or no supervision.
It contains a large dataset of 60K hours of unlabelled speech from audiobooks in English and a small labelled dataset (10h, 1h, and 10 min) plus metrics, trainable baseline models, and pretrained models that use these datasets.
It is covered in more detail at https://arxiv.org/abs/1912.07875.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not adding download functions, could you at least provide a link to where this dataset can be obtained from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

corpus_dir = Path(corpus_dir)
part_path = corpus_dir / subset
audio_paths = []
for root, dirs, files in os.walk(part_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this loop be turned into a one-liner with Path(part_path).rglob("*.flac")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@yfyeung
Copy link
Contributor Author

yfyeung commented Apr 1, 2023

Please re-check the code, thx.

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pzelasko pzelasko merged commit a4d9430 into lhotse-speech:master Apr 1, 2023
@pzelasko pzelasko added this to the v1.14 milestone Apr 1, 2023
@yfyeung yfyeung deleted the librilight branch April 2, 2023 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants