-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding Earnings-21 dataset from rev-dot-com #709
Conversation
@desh2608 at your convenience please review and comment |
lhotse/recipes/earnings21.py
Outdated
validate_recordings_and_supervisions(recording_set, supervision_set) | ||
if output_dir is not None: | ||
supervision_set.to_json(output_dir / "supervisions.json") | ||
recording_set.to_json(output_dir / "recordings.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend using jsonl
as the default, retrospectively, I should have never added the "json" format support :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Would be cool if that data had timestamps too, but it is what it is.
In the SPGISpeech recipe, we additionally have an option for text normalization, since the transcripts are orthographic. Perhaps it makes sense to add a similar option here? (see this method) |
Ok. Will add.
Y.
…On Fri, May 13, 2022 at 15:39 Desh Raj ***@***.***> wrote:
In the SPGISpeech recipe, we additionally have an option for text
normalization, since the transcripts are orthographic. Perhaps it makes
sense to add a similar option here? (see this method
<https://github.com/lhotse-speech/lhotse/blob/ef50dabf46e9d94b7516a6165933e0e2d5e91f6f/lhotse/recipes/spgispeech.py#L55>
)
—
Reply to this email directly, view it on GitHub
<#709 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX73XUC5N4YUX2APJALVJ2VWTANCNFSM5V4GEM2A>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
rebased on master and incorporated suggestions |
LGTM. |
if "earnings21" in f: | ||
zip.extract(f, path=target_dir) | ||
|
||
shutil.move(target_dir / "speech-datasets-main" / "earnings21", target_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just found out that for python < 3.9, shutil.move(src, dst)
has a bug when src
is of type PosixPath
, causing the download to fail with the following error:
13:55 $ lhotse download earnings21 /export/c07/draj/
2022-05-21 14:15:47,389 INFO [earnings21.py:64] Downloading Earnings21 from github repository is not very efficient way how to obtain the corpus. You will be downloading other data as well.
Traceback (most recent call last):
File "/home/draj/anaconda3/envs/scale/bin/lhotse", line 33, in <module>
sys.exit(load_entry_point('lhotse', 'console_scripts', 'lhotse')())
File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/export/c07/draj/mini_scale_2022/lhotse/lhotse/bin/modes/recipes/earnings21.py", line 12, in earnings21
download_earnings21(target_dir)
File "/export/c07/draj/mini_scale_2022/lhotse/lhotse/recipes/earnings21.py", line 91, in download_earnings21
shutil.move(
File "/home/draj/anaconda3/envs/scale/lib/python3.8/shutil.py", line 787, in move
real_dst = os.path.join(dst, _basename(src))
File "/home/draj/anaconda3/envs/scale/lib/python3.8/shutil.py", line 750, in _basename
return os.path.basename(path.rstrip(sep))
AttributeError: 'PosixPath' object has no attribute 'rstrip'
This can be resolved by simply changing to str(src)
in the function call. I will make this change when I create a PR next.
ack... thanks. and sad :)
y.
…On Sat, May 21, 2022 at 2:20 PM Desh Raj ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In lhotse/recipes/earnings21.py
<#709 (comment)>:
> + logging.info(f"Skipping - {completed_detector} exists.")
+ return extracted_dir
+
+ if force_download or not zip_path.is_file():
+ urlretrieve_progress(
+ url, filename=zip_path, desc="Getting speech-datasets-main.zip"
+ )
+
+ shutil.rmtree(extracted_dir, ignore_errors=True)
+
+ with zipfile.ZipFile(zip_path) as zip:
+ for f in zip.namelist():
+ if "earnings21" in f:
+ zip.extract(f, path=target_dir)
+
+ shutil.move(target_dir / "speech-datasets-main" / "earnings21", target_dir)
Just found out that for python < 3.9, shutil.move(src, dst) has a bug
<https://bugs.python.org/issue39140> when src is of type PosixPath,
causing the download to fail with the following error:
13:55 $ lhotse download earnings21 /export/c07/draj/2022-05-21 14:15:47,389 INFO [earnings21.py:64] Downloading Earnings21 from github repository is not very efficient way how to obtain the corpus. You will be downloading other data as well.Traceback (most recent call last):
File "/home/draj/anaconda3/envs/scale/bin/lhotse", line 33, in <module>
sys.exit(load_entry_point('lhotse', 'console_scripts', 'lhotse')())
File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/export/c07/draj/mini_scale_2022/lhotse/lhotse/bin/modes/recipes/earnings21.py", line 12, in earnings21
download_earnings21(target_dir)
File "/export/c07/draj/mini_scale_2022/lhotse/lhotse/recipes/earnings21.py", line 91, in download_earnings21
shutil.move(
File "/home/draj/anaconda3/envs/scale/lib/python3.8/shutil.py", line 787, in move
real_dst = os.path.join(dst, _basename(src))
File "/home/draj/anaconda3/envs/scale/lib/python3.8/shutil.py", line 750, in _basename
return os.path.basename(path.rstrip(sep))AttributeError: 'PosixPath' object has no attribute 'rstrip'
This can be resolved by simply changing to str(src) in the function call.
I will make this change when I create a PR next.
—
Reply to this email directly, view it on GitHub
<#709 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYXYGQMGUJ52L5K4TT53VLESPJANCNFSM5V4GEM2A>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Let me change the default format and we can commit
Y.
…On Fri, May 13, 2022 at 15:00 Piotr Żelasko ***@***.***> wrote:
***@***.**** approved this pull request.
Looks good to me! Would be cool if that data had timestamps too, but it is
what it is.
—
Reply to this email directly, view it on GitHub
<#709 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYXZKXFXT2RVPE5XYYBTVJ2RDLANCNFSM5V4GEM2A>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I'm ready to add Earnings-22 as well, as soon as this will pass the review (the corpus structure is similar)