Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental Lhotse feature: corpus creation tools (workflows), starting with OpenAI Whisper support #824

Merged
merged 2 commits into from
Sep 27, 2022

Conversation

pzelasko
Copy link
Collaborator

It's actually a very experimental feature, but people seem to be getting satisfying results with running Whisper on podcasts, videos, their own speech, etc. So maybe it can be used for weak labeling of unlabeled datasets. I added a "dummy-friendly" interface in Lhotse that can create full CutSet manifest with multiple supervisions per cut from just a directory of recordings. It doesn't try to be super efficient (no dataloading, batching, etc.).

Usage:

lhotse workflows annotate-with-whisper -r my_recordings -e flac cuts.jsonl.gz 

The results are saved to cuts.jsonl.gz (can be also cuts.jsonl).

@pzelasko pzelasko added this to the v1.8 milestone Sep 27, 2022
@pzelasko pzelasko marked this pull request as ready for review September 27, 2022 01:44
@mthrok
Copy link

mthrok commented Sep 27, 2022

Question out of curiosity: Is alignment part of this?

@pzelasko
Copy link
Collaborator Author

Question out of curiosity: Is alignment part of this?

Not right now, but after some quick skimming I think that info is available inside whisper.transcribe, so that could be added too.

@danpovey
Copy link
Collaborator

Cool!

@pzelasko pzelasko merged commit a9a849c into master Sep 27, 2022
@desh2608
Copy link
Collaborator

Perhaps we can also add a function that takes a CutSet as input instead of a RecordingSet, and modifies the supervisions. I am thinking of cases when we have long recordings with segments (e.g., obtained from a VAD), and we want to fill in the text attribute for the supervisions.

@pzelasko
Copy link
Collaborator Author

Perhaps we can also add a function that takes a CutSet as input instead of a RecordingSet, and modifies the supervisions. I am thinking of cases when we have long recordings with segments (e.g., obtained from a VAD), and we want to fill in the text attribute for the supervisions.

Makes sense. If it's something that will help you feel free to take it, my main aim for now is to find some time to dig into that word-level alignment thing, a lot of people seem interested in that.

@desh2608
Copy link
Collaborator

Perhaps we can also add a function that takes a CutSet as input instead of a RecordingSet, and modifies the supervisions. I am thinking of cases when we have long recordings with segments (e.g., obtained from a VAD), and we want to fill in the text attribute for the supervisions.

Makes sense. If it's something that will help you feel free to take it, my main aim for now is to find some time to dig into that word-level alignment thing, a lot of people seem interested in that.

Sure, will create a PR.

@pzelasko
Copy link
Collaborator Author

OK I mis-read the Whisper paper, they only provide segment-level start/end timestamps, I thought they have word-level timestamps too. I only figured it out while spending more time with the internals. Still, it would be very cool to provide a robust out-of-the-box word-level forced aligner for data annotation. Maybe some Icefall models can be used for that? @danpovey @csukuangfj what do you think is the best approach here?

@csukuangfj
Copy link
Contributor

We have force-alignment code in icefall with RNN-T.
k2-fsa/icefall#239

But the results may not be as accurate as the one obtained using HMM-GMM.


I am not sure whether it is worthwhile to extract the HMM-GMM part from kaldi as an external project, rewrite it with libtorch , and wrap it to python.

@desh2608
Copy link
Collaborator

I think @huangruizhe was working on some neural aligners.

@huangruizhe
Copy link

Yes, but it is still ongoing.

supervisions = _postprocess_timestamps(supervisions)
cut.supervisions = list(
trim_supervisions_to_recordings(
recordings=recordings, supervisions=supervisions, verbose=False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be recordings=recording instead of recordings=recordings?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would still work since internally it checks only the corresponding recording_id, but we are passing around the full manifest for no reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants