-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug when transcribing audio with Whisper #891
Comments
Can you trace at which point you got overlapping supervisions from Whisper and post more information? I think that generally shouldn't happen, it may be a bug in the |
I am having the same problem with force aligning Whisper output with wav2vec using annotate_with_whisper. I'm getting a 14.4% failure rate due to this issue. I ran two sets of 90 speech audio files and had 13 fail in this way from both sets. The sets are disjoint, all files contain different speech. Durations, noise, and accents vary for the failures, from 10 minute to 60 minute durations, and accents from different UK regions. I examined each audio file that resulted in failure and did not discern any obvious commonality between them. I am using Whisper settings device="cuda", language="en" and model_name="small.en". |
Can you show an example of a cut with overlapping supervisions? Maybe that would help trace it if it's a bug in how we parse Whisper or something else. |
I might be seeing the same or similar issues (I think) but perhaps I'm doing something wrong. Code: from lhotse import (
CutSet,
Recording,
RecordingSet,
align_with_torchaudio,
annotate_with_whisper,
)
from pathlib import Path
from pprint import pprint
recording = Recording.from_file(Path("./bug2.wav"))
recordings = RecordingSet.from_recordings([recording])
cuts = annotate_with_whisper(recordings, model_name="tiny", device="cuda")
cuts_aligned = align_with_torchaudio(cuts)
next(cuts_aligned) Error:
Added logging of the cut, here is the full output: https://gist.github.com/mattzque/0e048e68ff5a0034a8c7d21511d2e2b0 When I change the model from tiny to medium, I get this error:
Full output of the cut again: https://gist.github.com/mattzque/6762b16ea6645454c1cb250a86aecb5b Using lhotse==1.11.0 and whisper==1.1.10 |
I think both of these bugs may be related to the timestamp postprocessing that is performed on the segments generated by Whisper (see here). We truncate a segment's end time to the start time of the next segment, so that no overlapping segments are generated. I think a small bug might be that the segments are not sorted by start time before the post-processing. Could you try adding However, I feel that there is a larger issue here. This kind of post-processing probably causes a lot of long segments to get truncated. Consider the following case:
Whisper recognizes a long segment and a short overlapping segment, but if we apply the post-processing, the long segment's end will get truncated to the start time of the short segment, which will result in cutting a lot of the audio. I feel like we should disable the post-processing by default and only provide it as an additional option to be enabled. This ties into the issue with overlapping supervisions in the forced alignment workflow as follows. In the |
If I add this before the assertion in [...]
for cut in cuts:
supervisions = cut.supervisions
supervisions = sorted(supervisions, key=lambda s: s.start)
cut.supervisions = supervisions
assert not cut.has_overlapping_supervisions, [...] I get the same error "We don't support forced alignment of cuts with overlapping supervisions (cut ID: 'bug2')". If I remove the assertion, I get this error:
|
No I meant that you may need to bypass the whole |
Also, I meant to add the sorting at the start of the |
But I think that Whisper cannot recognize/annotate overlapping speech segments. It would simply show it as a single recognition/segment. I don't really understand where the overlaps are coming from. I'll try to find some time to debug this but it might have to be after holidays.
If we must live with overlapping segments from Whisper then I agree, this solution looks reasonable. |
I looked at the whisper output on the audio, and it seems the timestamps from Whisper itself are erroneous on that segment ( |
Thanks for looking into this, see also the example audio file I attached above. |
Thanks guys! |
@mattzque the above PR should fix your issue. |
This PR addresses #891. * Remove segments with non-positive duration from the whisper output * Segment post-processing to force non-overlapping is made optional (disabled by default) * Allow overlapping segments in forced alignment workflow
@desh2608 Thank you! This fixes the issues for me! |
Resolved by #928. |
Hi there,
I am currently working on a project where we first transcribe audio recordings with Whisper and then force align them with Wav2Vec.
When I just run the cuts generated by the method "annotate_with_whisper" into the method "align_with_torchaudio", I get the following error:
AssertionError: We don't support forced alignment of cuts with overlapping supervisions (cut ID: 'A011221._001_TR6')
When I try to trim the supervisions, I run into another issue where suddenly supervisions with a negative duration appear.
So far, I could not find a fix myself. Do you have any suggestions?
The text was updated successfully, but these errors were encountered: