Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding Earnings-21 dataset from rev-dot-com #709

Merged
merged 1 commit into from
May 14, 2022

Conversation

jtrmal
Copy link
Collaborator

@jtrmal jtrmal commented May 13, 2022

I'm ready to add Earnings-22 as well, as soon as this will pass the review (the corpus structure is similar)

@jtrmal
Copy link
Collaborator Author

jtrmal commented May 13, 2022

@desh2608 at your convenience please review and comment

validate_recordings_and_supervisions(recording_set, supervision_set)
if output_dir is not None:
supervision_set.to_json(output_dir / "supervisions.json")
recording_set.to_json(output_dir / "recordings.json")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using jsonl as the default, retrospectively, I should have never added the "json" format support :)

pzelasko
pzelasko previously approved these changes May 13, 2022
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Would be cool if that data had timestamps too, but it is what it is.

@desh2608
Copy link
Collaborator

In the SPGISpeech recipe, we additionally have an option for text normalization, since the transcripts are orthographic. Perhaps it makes sense to add a similar option here? (see this method)

@jtrmal
Copy link
Collaborator Author

jtrmal commented May 13, 2022 via email

@jtrmal
Copy link
Collaborator Author

jtrmal commented May 13, 2022

rebased on master and incorporated suggestions

@desh2608
Copy link
Collaborator

LGTM.

@pzelasko pzelasko merged commit de75634 into lhotse-speech:master May 14, 2022
@pzelasko pzelasko added this to the v1.2 milestone May 16, 2022
if "earnings21" in f:
zip.extract(f, path=target_dir)

shutil.move(target_dir / "speech-datasets-main" / "earnings21", target_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found out that for python < 3.9, shutil.move(src, dst) has a bug when src is of type PosixPath, causing the download to fail with the following error:

13:55 $ lhotse download earnings21 /export/c07/draj/
2022-05-21 14:15:47,389 INFO [earnings21.py:64] Downloading Earnings21 from github repository is not very efficient way how to obtain the corpus. You will be downloading other data as well.
Traceback (most recent call last):
  File "/home/draj/anaconda3/envs/scale/bin/lhotse", line 33, in <module>
    sys.exit(load_entry_point('lhotse', 'console_scripts', 'lhotse')())
  File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/export/c07/draj/mini_scale_2022/lhotse/lhotse/bin/modes/recipes/earnings21.py", line 12, in earnings21
    download_earnings21(target_dir)
  File "/export/c07/draj/mini_scale_2022/lhotse/lhotse/recipes/earnings21.py", line 91, in download_earnings21
    shutil.move(
  File "/home/draj/anaconda3/envs/scale/lib/python3.8/shutil.py", line 787, in move
    real_dst = os.path.join(dst, _basename(src))
  File "/home/draj/anaconda3/envs/scale/lib/python3.8/shutil.py", line 750, in _basename
    return os.path.basename(path.rstrip(sep))
AttributeError: 'PosixPath' object has no attribute 'rstrip'

This can be resolved by simply changing to str(src) in the function call. I will make this change when I create a PR next.

@jtrmal
Copy link
Collaborator Author

jtrmal commented May 21, 2022 via email

@jtrmal
Copy link
Collaborator Author

jtrmal commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants