Experimental Lhotse feature: corpus creation tools (`workflows`), starting with OpenAI Whisper support #824

pzelasko · 2022-09-27T01:44:40Z

It's actually a very experimental feature, but people seem to be getting satisfying results with running Whisper on podcasts, videos, their own speech, etc. So maybe it can be used for weak labeling of unlabeled datasets. I added a "dummy-friendly" interface in Lhotse that can create full CutSet manifest with multiple supervisions per cut from just a directory of recordings. It doesn't try to be super efficient (no dataloading, batching, etc.).

Usage:

lhotse workflows annotate-with-whisper -r my_recordings -e flac cuts.jsonl.gz

The results are saved to cuts.jsonl.gz (can be also cuts.jsonl).

…tarting with OpenAI Whisper support

mthrok · 2022-09-27T01:51:47Z

Question out of curiosity: Is alignment part of this?

pzelasko · 2022-09-27T01:56:19Z

Question out of curiosity: Is alignment part of this?

Not right now, but after some quick skimming I think that info is available inside whisper.transcribe, so that could be added too.

danpovey · 2022-09-27T04:09:18Z

Cool!

desh2608 · 2022-09-27T17:21:56Z

Perhaps we can also add a function that takes a CutSet as input instead of a RecordingSet, and modifies the supervisions. I am thinking of cases when we have long recordings with segments (e.g., obtained from a VAD), and we want to fill in the text attribute for the supervisions.

pzelasko · 2022-09-27T19:00:17Z

Perhaps we can also add a function that takes a CutSet as input instead of a RecordingSet, and modifies the supervisions. I am thinking of cases when we have long recordings with segments (e.g., obtained from a VAD), and we want to fill in the text attribute for the supervisions.

Makes sense. If it's something that will help you feel free to take it, my main aim for now is to find some time to dig into that word-level alignment thing, a lot of people seem interested in that.

desh2608 · 2022-09-27T19:39:34Z

Perhaps we can also add a function that takes a CutSet as input instead of a RecordingSet, and modifies the supervisions. I am thinking of cases when we have long recordings with segments (e.g., obtained from a VAD), and we want to fill in the text attribute for the supervisions.

Makes sense. If it's something that will help you feel free to take it, my main aim for now is to find some time to dig into that word-level alignment thing, a lot of people seem interested in that.

Sure, will create a PR.

pzelasko · 2022-09-28T12:49:13Z

OK I mis-read the Whisper paper, they only provide segment-level start/end timestamps, I thought they have word-level timestamps too. I only figured it out while spending more time with the internals. Still, it would be very cool to provide a robust out-of-the-box word-level forced aligner for data annotation. Maybe some Icefall models can be used for that? @danpovey @csukuangfj what do you think is the best approach here?

csukuangfj · 2022-09-28T13:18:31Z

We have force-alignment code in icefall with RNN-T.
k2-fsa/icefall#239

But the results may not be as accurate as the one obtained using HMM-GMM.

I am not sure whether it is worthwhile to extract the HMM-GMM part from kaldi as an external project, rewrite it with libtorch , and wrap it to python.

desh2608 · 2022-09-28T13:37:36Z

I think @huangruizhe was working on some neural aligners.

huangruizhe · 2022-09-28T15:07:05Z

Yes, but it is still ongoing.

desh2608 · 2022-10-02T15:18:54Z

lhotse/workflows/whisper.py

+            supervisions = _postprocess_timestamps(supervisions)
+            cut.supervisions = list(
+                trim_supervisions_to_recordings(
+                    recordings=recordings, supervisions=supervisions, verbose=False


Shouldn't this be recordings=recording instead of recordings=recordings?

I guess it would still work since internally it checks only the corresponding recording_id, but we are passing around the full manifest for no reason.

Good point!

Experimental Lhotse feature: corpus creation tools (workflows), s…

7363d94

…tarting with OpenAI Whisper support

pzelasko added this to the v1.8 milestone Sep 27, 2022

pzelasko marked this pull request as ready for review September 27, 2022 01:44

Fix import

a490d80

pzelasko merged commit a9a849c into master Sep 27, 2022

pzelasko mentioned this pull request Sep 28, 2022

[workflow] Word-level forced alignment with pretrained models from Torchaudio #827

Merged

desh2608 reviewed Oct 2, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental Lhotse feature: corpus creation tools (`workflows`), starting with OpenAI Whisper support #824

Experimental Lhotse feature: corpus creation tools (`workflows`), starting with OpenAI Whisper support #824

pzelasko commented Sep 27, 2022

mthrok commented Sep 27, 2022

pzelasko commented Sep 27, 2022

danpovey commented Sep 27, 2022

desh2608 commented Sep 27, 2022

pzelasko commented Sep 27, 2022

desh2608 commented Sep 27, 2022

pzelasko commented Sep 28, 2022

csukuangfj commented Sep 28, 2022

desh2608 commented Sep 28, 2022

huangruizhe commented Sep 28, 2022

desh2608 Oct 2, 2022

desh2608 Oct 2, 2022

pzelasko Oct 2, 2022

Experimental Lhotse feature: corpus creation tools (workflows), starting with OpenAI Whisper support #824

Experimental Lhotse feature: corpus creation tools (workflows), starting with OpenAI Whisper support #824

Conversation

pzelasko commented Sep 27, 2022

mthrok commented Sep 27, 2022

pzelasko commented Sep 27, 2022

danpovey commented Sep 27, 2022

desh2608 commented Sep 27, 2022

pzelasko commented Sep 27, 2022

desh2608 commented Sep 27, 2022

pzelasko commented Sep 28, 2022

csukuangfj commented Sep 28, 2022

desh2608 commented Sep 28, 2022

huangruizhe commented Sep 28, 2022

desh2608 Oct 2, 2022

Choose a reason for hiding this comment

desh2608 Oct 2, 2022

Choose a reason for hiding this comment

pzelasko Oct 2, 2022

Choose a reason for hiding this comment

Experimental Lhotse feature: corpus creation tools (`workflows`), starting with OpenAI Whisper support #824

Experimental Lhotse feature: corpus creation tools (`workflows`), starting with OpenAI Whisper support #824