-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize output path names for recipes #712
Conversation
BTW about 1/4th of the recipes already follow this naming (e.g. gigaspeech, spgispeech, swbd, ...) |
A big +1 from me, thank you for this effort 🙏🏻 Some suggestions:
If you agree, could you add these tips to https://github.com/lhotse-speech/lhotse/blob/master/docs/corpus.rst#adding-new-corpora ? |
if this, then why not just having them gzipped by default? some of them are
and some of them aren't
y.
…On Mon, May 16, 2022 at 1:39 PM Piotr Żelasko ***@***.***> wrote:
A big +1 from me, thank you for this effort 🙏🏻
Some suggestions:
- we could adopt a convention that corpora without splits would be
named <corpus-name>_<manifest-type>_all.jsonl.
- when a corpus specifies multiple types of sub-splits (e.g. mtedx has
language) put these extra parts before <manifest-type> so that if we
do parts = path.stem.split("_") then the following two are always
true: parts[-1] == '<train/dev/test/all-split>', and parts[-2] ==
<manifest-type>, maybe that can help some automation if people combine
more data together.
—
Reply to this email directly, view it on GitHub
<#712 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX6ILAMHISQ4L2KXXWDVKKB3JANCNFSM5WCG3EAQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I defer to Piotr's opinion :)
y.
…On Mon, May 16, 2022 at 1:56 PM Desh Raj ***@***.***> wrote:
@pzelasko <https://github.com/pzelasko> @jtrmal
<https://github.com/jtrmal> both of those suggestions make sense to me. I
didn't add any gzipping since my understanding was we mainly do it for
large(ish) manifests, but I can add that too if it's the consensus.
—
Reply to this email directly, view it on GitHub
<#712 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX2EPUN4NMI7CCSNOM3VKKD3XANCNFSM5WCG3EAQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I don't see anything to lose and I guess it makes people's lives easier. |
Done :) |
Awesome! |
@desh2608 - Just adding my comment that I sent to Piotr. I did this several weeks ago - and the motivation was that I wanted to use recipes/utils on kaldi imported data. I found that if I do the following modifications to the recipes/utils.py my life becomes super easy. But this will not work with the lhotse recipes because the naming convention for kaldi imports is different than the convention for lhotse recipes. How about making both the same? i.e. dataset_name/supervisons.jsonl.gz etc. Please look. Here is a diff -
|
This PR "normalizes" the output manifests for the recipes:
<corpus-name>_<manifest-type>_<part>.jsonl
, e.g., "ami_recordings_dev.jsonl". This is to ensure that manifests from multiple corpora can be saved to same directory without name-based confusion as would occur with names like "recordings_dev.jsonl".Obviously, we would have to modify the data preparation in icefall to be compatible with these changes, but I feel it's better to switch to this normalized naming scheme now rather than when it would break a lot more recipes.