-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental Lhotse feature: corpus creation tools (workflows
), starting with OpenAI Whisper support
#824
Conversation
…tarting with OpenAI Whisper support
Question out of curiosity: Is alignment part of this? |
Not right now, but after some quick skimming I think that info is available inside |
Cool! |
Perhaps we can also add a function that takes a CutSet as input instead of a RecordingSet, and modifies the supervisions. I am thinking of cases when we have long recordings with segments (e.g., obtained from a VAD), and we want to fill in the |
Makes sense. If it's something that will help you feel free to take it, my main aim for now is to find some time to dig into that word-level alignment thing, a lot of people seem interested in that. |
Sure, will create a PR. |
OK I mis-read the Whisper paper, they only provide segment-level start/end timestamps, I thought they have word-level timestamps too. I only figured it out while spending more time with the internals. Still, it would be very cool to provide a robust out-of-the-box word-level forced aligner for data annotation. Maybe some Icefall models can be used for that? @danpovey @csukuangfj what do you think is the best approach here? |
We have force-alignment code in icefall with RNN-T. But the results may not be as accurate as the one obtained using HMM-GMM. I am not sure whether it is worthwhile to extract the HMM-GMM part from kaldi as an external project, rewrite it with libtorch , and wrap it to python. |
I think @huangruizhe was working on some neural aligners. |
Yes, but it is still ongoing. |
supervisions = _postprocess_timestamps(supervisions) | ||
cut.supervisions = list( | ||
trim_supervisions_to_recordings( | ||
recordings=recordings, supervisions=supervisions, verbose=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be recordings=recording
instead of recordings=recordings
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would still work since internally it checks only the corresponding recording_id, but we are passing around the full manifest for no reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
It's actually a very experimental feature, but people seem to be getting satisfying results with running Whisper on podcasts, videos, their own speech, etc. So maybe it can be used for weak labeling of unlabeled datasets. I added a "dummy-friendly" interface in Lhotse that can create full CutSet manifest with multiple supervisions per cut from just a directory of recordings. It doesn't try to be super efficient (no dataloading, batching, etc.).
Usage:
The results are saved to
cuts.jsonl.gz
(can be alsocuts.jsonl
).