Add option to split AMI segments similar to Kaldi #889

desh2608 · 2022-11-12T17:58:02Z

Kaldi AMI data preparation includes splitting long segments based on full-stops and commas. This is useful for training ASR models since otherwise we would throw away a lot of data. Here are some statistics before and after adding this option:

# segments	Train	Dev	Test
Kaldi	105149	13059	12612
Lhotse (w/o splitting)	65557	8665	7490
Lhotse (w/ splitting)	103377	12614	12121

The remaining difference may be because we don't "over-segment" the segments if they are already under the max words.

desh2608 · 2022-11-12T18:06:44Z

(@HuangZiliAndy and I have both observed that decoding WERs can be significantly better with the segment splitting.)

pzelasko · 2022-11-12T21:55:38Z

Great!

add option to split segments similar to Kaldi

b5e6fe3

pzelasko approved these changes Nov 12, 2022

View reviewed changes

pzelasko merged commit defdafc into lhotse-speech:master Nov 12, 2022

pzelasko added this to the v1.10 milestone Nov 12, 2022

desh2608 deleted the recipe/ami_split_segs branch November 2, 2023 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to split AMI segments similar to Kaldi #889

Add option to split AMI segments similar to Kaldi #889

desh2608 commented Nov 12, 2022

desh2608 commented Nov 12, 2022

pzelasko commented Nov 12, 2022

Add option to split AMI segments similar to Kaldi #889

Add option to split AMI segments similar to Kaldi #889

Conversation

desh2608 commented Nov 12, 2022

desh2608 commented Nov 12, 2022

pzelasko commented Nov 12, 2022