Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize output path names for recipes #712

Merged
merged 7 commits into from
May 16, 2022

Conversation

desh2608
Copy link
Collaborator

This PR "normalizes" the output manifests for the recipes:

  • All JSON type outputs are changed to JSONL (ses this comment)
  • Manifests are named as <corpus-name>_<manifest-type>_<part>.jsonl, e.g., "ami_recordings_dev.jsonl". This is to ensure that manifests from multiple corpora can be saved to same directory without name-based confusion as would occur with names like "recordings_dev.jsonl".

Obviously, we would have to modify the data preparation in icefall to be compatible with these changes, but I feel it's better to switch to this normalized naming scheme now rather than when it would break a lot more recipes.

@desh2608
Copy link
Collaborator Author

BTW about 1/4th of the recipes already follow this naming (e.g. gigaspeech, spgispeech, swbd, ...)

@pzelasko
Copy link
Collaborator

pzelasko commented May 16, 2022

A big +1 from me, thank you for this effort 🙏🏻

Some suggestions:

  • we could adopt a convention that corpora without splits would be named <corpus-name>_<manifest-type>_all.jsonl.
  • when a corpus specifies multiple types of sub-splits (e.g. mtedx has language) put these extra parts before <manifest-type> so that if we do parts = path.stem.split("_") then the following two are always true: parts[-1] == '<train/dev/test/all-split>', and parts[-2] == <manifest-type>, maybe that can help some automation if people combine more data together.

If you agree, could you add these tips to https://github.com/lhotse-speech/lhotse/blob/master/docs/corpus.rst#adding-new-corpora ?

@pzelasko pzelasko added this to the v1.2 milestone May 16, 2022
@jtrmal
Copy link
Collaborator

jtrmal commented May 16, 2022 via email

@desh2608
Copy link
Collaborator Author

@pzelasko @jtrmal both of those suggestions make sense to me. I didn't add any gzipping since my understanding was we mainly do it for large(ish) manifests, but I can add that too if it's the consensus.

@jtrmal
Copy link
Collaborator

jtrmal commented May 16, 2022 via email

@pzelasko
Copy link
Collaborator

I don't see anything to lose and I guess it makes people's lives easier.

@desh2608
Copy link
Collaborator Author

Done :)

@pzelasko
Copy link
Collaborator

Awesome!

@pzelasko pzelasko merged commit f2c2ce4 into lhotse-speech:master May 16, 2022
@desh2608 desh2608 deleted the corpus/out_name branch May 16, 2022 19:34
@ngoel17
Copy link

ngoel17 commented May 16, 2022

@desh2608 - Just adding my comment that I sent to Piotr. I did this several weeks ago - and the motivation was that I wanted to use recipes/utils on kaldi imported data.

I found that if I do the following modifications to the recipes/utils.py my life becomes super easy. But this will not work with the lhotse recipes because the naming convention for kaldi imports is different than the convention for lhotse recipes.

How about making both the same? i.e. dataset_name/supervisons.jsonl.gz etc. Please look.

Here is a diff -

:param prefix: Optional common prefix for the manifest files (underscore is automatically added).
@@ -39,8 +42,6 @@ def read_manifests_if_cached(
     """
     if output_dir is None:
         return {}
-    if prefix and not prefix.endswith("_"):
-        prefix = f"{prefix}_"
     if suffix.startswith("."):
         suffix = suffix[1:]
     if lazy and not suffix.startswith("jsonl"):
@@ -48,17 +49,19 @@ def read_manifests_if_cached(
             f"Only JSONL manifests can be opened lazily (got suffix: '{suffix}')"
         )
     manifests = defaultdict(dict)
-    for part in dataset_parts:
-        for manifest in types:
-            path = output_dir / f"{prefix}{manifest}_{part}.{suffix}"
-            if not path.is_file():
-                continue
-            if lazy:
-                manifests[part][manifest] = TYPES_TO_CLASSES[manifest].from_jsonl_lazy(
-                    path
-                )
-            else:
-                manifests[part][manifest] = load_manifest(path)
+    for dataset in datasets:
+        for part in dataset_parts:
+            for manifest in types:
+                path = output_dir / f"./{prefix}/{dataset}/{manifest}{part}.{suffix}"
+                if not path.is_file():
+                    [logging.info](http://logging.info/)(f'{path} not found')
+                    continue
+                if lazy:
+                    manifests[dataset][manifest] = TYPES_TO_CLASSES[manifest].from_jsonl_lazy(
+                        path
+                    )
+                else:
+                    manifests[dataset][manifest] = load_manifest(path)
     return dict(manifests)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants