Normalize output path names for recipes #712

desh2608 · 2022-05-16T17:31:54Z

This PR "normalizes" the output manifests for the recipes:

All JSON type outputs are changed to JSONL (ses this comment)
Manifests are named as <corpus-name>_<manifest-type>_<part>.jsonl, e.g., "ami_recordings_dev.jsonl". This is to ensure that manifests from multiple corpora can be saved to same directory without name-based confusion as would occur with names like "recordings_dev.jsonl".

Obviously, we would have to modify the data preparation in icefall to be compatible with these changes, but I feel it's better to switch to this normalized naming scheme now rather than when it would break a lot more recipes.

desh2608 · 2022-05-16T17:37:30Z

BTW about 1/4th of the recipes already follow this naming (e.g. gigaspeech, spgispeech, swbd, ...)

pzelasko · 2022-05-16T17:38:46Z

A big +1 from me, thank you for this effort 🙏🏻

Some suggestions:

we could adopt a convention that corpora without splits would be named <corpus-name>_<manifest-type>_all.jsonl.
when a corpus specifies multiple types of sub-splits (e.g. mtedx has language) put these extra parts before <manifest-type> so that if we do parts = path.stem.split("_") then the following two are always true: parts[-1] == '<train/dev/test/all-split>', and parts[-2] == <manifest-type>, maybe that can help some automation if people combine more data together.

If you agree, could you add these tips to https://github.com/lhotse-speech/lhotse/blob/master/docs/corpus.rst#adding-new-corpora ?

jtrmal · 2022-05-16T17:40:37Z

if this, then why not just having them gzipped by default? some of them are and some of them aren't y.

…

On Mon, May 16, 2022 at 1:39 PM Piotr Żelasko ***@***.***> wrote: A big +1 from me, thank you for this effort 🙏🏻 Some suggestions: - we could adopt a convention that corpora without splits would be named <corpus-name>_<manifest-type>_all.jsonl. - when a corpus specifies multiple types of sub-splits (e.g. mtedx has language) put these extra parts before <manifest-type> so that if we do parts = path.stem.split("_") then the following two are always true: parts[-1] == '<train/dev/test/all-split>', and parts[-2] == <manifest-type>, maybe that can help some automation if people combine more data together. — Reply to this email directly, view it on GitHub <#712 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX6ILAMHISQ4L2KXXWDVKKB3JANCNFSM5WCG3EAQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

desh2608 · 2022-05-16T17:55:56Z

@pzelasko @jtrmal both of those suggestions make sense to me. I didn't add any gzipping since my understanding was we mainly do it for large(ish) manifests, but I can add that too if it's the consensus.

jtrmal · 2022-05-16T17:58:56Z

I defer to Piotr's opinion :) y.

…

On Mon, May 16, 2022 at 1:56 PM Desh Raj ***@***.***> wrote: @pzelasko <https://github.com/pzelasko> @jtrmal <https://github.com/jtrmal> both of those suggestions make sense to me. I didn't add any gzipping since my understanding was we mainly do it for large(ish) manifests, but I can add that too if it's the consensus. — Reply to this email directly, view it on GitHub <#712 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX2EPUN4NMI7CCSNOM3VKKD3XANCNFSM5WCG3EAQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

… corpus/out_name

pzelasko · 2022-05-16T18:07:58Z

I don't see anything to lose and I guess it makes people's lives easier.

desh2608 · 2022-05-16T18:30:01Z

Done :)

pzelasko · 2022-05-16T19:27:38Z

Awesome!

ngoel17 · 2022-05-16T19:54:13Z

@desh2608 - Just adding my comment that I sent to Piotr. I did this several weeks ago - and the motivation was that I wanted to use recipes/utils on kaldi imported data.

I found that if I do the following modifications to the recipes/utils.py my life becomes super easy. But this will not work with the lhotse recipes because the naming convention for kaldi imports is different than the convention for lhotse recipes.

How about making both the same? i.e. dataset_name/supervisons.jsonl.gz etc. Please look.

Here is a diff -

:param prefix: Optional common prefix for the manifest files (underscore is automatically added).
@@ -39,8 +42,6 @@ def read_manifests_if_cached(
     """
     if output_dir is None:
         return {}
-    if prefix and not prefix.endswith("_"):
-        prefix = f"{prefix}_"
     if suffix.startswith("."):
         suffix = suffix[1:]
     if lazy and not suffix.startswith("jsonl"):
@@ -48,17 +49,19 @@ def read_manifests_if_cached(
             f"Only JSONL manifests can be opened lazily (got suffix: '{suffix}')"
         )
     manifests = defaultdict(dict)
-    for part in dataset_parts:
-        for manifest in types:
-            path = output_dir / f"{prefix}{manifest}_{part}.{suffix}"
-            if not path.is_file():
-                continue
-            if lazy:
-                manifests[part][manifest] = TYPES_TO_CLASSES[manifest].from_jsonl_lazy(
-                    path
-                )
-            else:
-                manifests[part][manifest] = load_manifest(path)
+    for dataset in datasets:
+        for part in dataset_parts:
+            for manifest in types:
+                path = output_dir / f"./{prefix}/{dataset}/{manifest}{part}.{suffix}"
+                if not path.is_file():
+                    [logging.info](http://logging.info/)(f'{path} not found')
+                    continue
+                if lazy:
+                    manifests[dataset][manifest] = TYPES_TO_CLASSES[manifest].from_jsonl_lazy(
+                        path
+                    )
+                else:
+                    manifests[dataset][manifest] = load_manifest(path)
     return dict(manifests)

desh2608 added 2 commits May 16, 2022 13:24

normalize output path for recipes

43a1a29

minor fix

7d92c6f

pzelasko added this to the v1.2 milestone May 16, 2022

Merge branch 'master' of https://github.com/lhotse-speech/lhotse into…

98c6e01

… corpus/out_name

desh2608 added 2 commits May 16, 2022 14:27

make gzip default; add _all in case of no split

4efb008

minor fix

e2e9b91

desh2608 and others added 2 commits May 16, 2022 14:45

minor fix in musan

05e4e51

Merge branch 'master' into corpus/out_name

2d0b54b

pzelasko merged commit f2c2ce4 into lhotse-speech:master May 16, 2022

desh2608 deleted the corpus/out_name branch May 16, 2022 19:34

desh2608 mentioned this pull request May 17, 2022

[ali-meeting] Change jsonl to json #715

Closed

entn-at mentioned this pull request May 23, 2022

[egs] Add prefix when reading manifests due to recent lhotse changes k2-fsa/icefall#382

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize output path names for recipes #712

Normalize output path names for recipes #712

desh2608 commented May 16, 2022

desh2608 commented May 16, 2022

pzelasko commented May 16, 2022 •

edited

Loading

jtrmal commented May 16, 2022 via email

desh2608 commented May 16, 2022

jtrmal commented May 16, 2022 via email

pzelasko commented May 16, 2022

desh2608 commented May 16, 2022

pzelasko commented May 16, 2022

ngoel17 commented May 16, 2022 •

edited

Loading

Normalize output path names for recipes #712

Normalize output path names for recipes #712

Conversation

desh2608 commented May 16, 2022

desh2608 commented May 16, 2022

pzelasko commented May 16, 2022 • edited Loading

jtrmal commented May 16, 2022 via email

desh2608 commented May 16, 2022

jtrmal commented May 16, 2022 via email

pzelasko commented May 16, 2022

desh2608 commented May 16, 2022

pzelasko commented May 16, 2022

ngoel17 commented May 16, 2022 • edited Loading

pzelasko commented May 16, 2022 •

edited

Loading

ngoel17 commented May 16, 2022 •

edited

Loading