Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Air Traffic Control (ATC) corpora #1061

Merged
merged 3 commits into from
May 23, 2023

Conversation

rouseabout
Copy link
Contributor

No description provided.

@@ -63,6 +64,7 @@
from .timit import *
from .vctk import *
from .voxceleb import *
from .uwb_atcc import *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should go above vctk

)
recording = Recording.from_file(wav_path, recording_id=row.recording_id)
segment = SupervisionSegment(
id="atcosim_%s_%06d_%06d" % (row.filename, 0, row.length_sec * 100),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally use f-strings for these since they are more readable, but it's your call.

from lhotse.utils import Pathlike, is_module_available, resumable_download


def safe_extract_rar(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to lhotse/utils.py (since we also has safe_extract() there)?

@desh2608
Copy link
Collaborator

Please also fix the style issues.

@rouseabout
Copy link
Contributor Author

Thanks for the review! Have two questions on supervision text normalization:

  • Should supervision text be upper or lowercase? It seems many recipes are using uppercase.

  • How best to express individually spelt-out English letters? The current patch use tilde prefixes inspired from the ATCOSIM paper. For example, the supervision text "~a ~b ~c" is spoken as "aye bee see".

@desh2608
Copy link
Collaborator

Thanks for the review! Have two questions on supervision text normalization:

  • Should supervision text be upper or lowercase? It seems many recipes are using uppercase.
  • How best to express individually spelt-out English letters? The current patch use tilde prefixes inspired from the ATCOSIM paper. For example, the supervision text "~a ~b ~c" is spoken as "aye bee see".

Usually the normalization is recipe-specific --- we don't enforce any particular normalization standards. If there is an existing Kaldi or ESPNet (or other popular toolkit) recipe for the data, we encourage the recipe writers to provide normalization option to be similar to those recipes, so that results are comparable. You can find such methods in lhotse/recipes/utils.py. You can choose to add your normalization method in that script if you think it may be applicable to other corpora as well. For ASR corpora, if orthographic transcription is not required, you can provide upper-casing and punctuation removal as normalization.

Regarding your other question, usually you would just write them out separately in the transcript, e.g., "A B C" and most modern end-to-end ASR models should be able to learn that individually spelled out characters correspond to such sounds.

@danpovey
Copy link
Collaborator

Just for your interest--
for future systems we have in mind totally un-normalized operation, where we do essentially no normalization at all on the source text except removing things like HTML markup. And we'll use BPE encoding with failover to bytes (or just bytes themselves) so that any UTF-8 sequences can be encoded.
The results of training a system like this on a bunch of librivox data seem, anecdotally, very good-- it outputs good quality punctuation.
Of course we'll still optionally do normalization for scoring purposes so we can compare with the traditional WER metric.

@rouseabout rouseabout force-pushed the air-traffic-control-corpora branch from 7241876 to ae21640 Compare May 18, 2023 12:56
@rouseabout
Copy link
Contributor Author

Another question - Any special reason why many recipes round supervision segment duration to ndigits=8?

I am currently not using rounding in my ATC recipes. I observe that when iterating over the datasets using K2SpeechRecognitionDataset, for some batches, the batch input features T dimension (num_frames) does not match the batch supervisions num_frames max value. For these bad batches, the batch input T dimension is always 1 frame larger than the maximum value of MonoCut.features.num_frames of the batch.

@pzelasko
Copy link
Collaborator

Another question - Any special reason why many recipes round supervision segment duration to ndigits=8?

Because duration is often computed as end - start from another source, and that introduces float precision errors which are annoying.

I am currently not using rounding in my ATC recipes. I observe that when iterating over the datasets using K2SpeechRecognitionDataset, for some batches, the batch input features T dimension (num_frames) does not match the batch supervisions num_frames max value. For these bad batches, the batch input T dimension is always 1 frame larger than the maximum value of MonoCut.features.num_frames of the batch.

I don't think rounding would contribute to that. The issue you are seeing may be related to some padding/collation edge cases, we could try to debug it with more info.

@pzelasko
Copy link
Collaborator

@rouseabout @desh2608 Is this ready to merge? If yes, LGTM :)

@desh2608
Copy link
Collaborator

@rouseabout @desh2608 Is this ready to merge? If yes, LGTM :)

It's good to go from my side.

@pzelasko pzelasko enabled auto-merge (squash) May 23, 2023 20:30
@pzelasko pzelasko merged commit ed8620d into lhotse-speech:master May 23, 2023
@pzelasko pzelasko added this to the v1.15 milestone May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants